3. Methodology

This project is being conducted in two phases. In phase I (almost completed), a database is being compiled of all nouns listed in the Standard Swahili-English Dictionary (Johnson 1939, henceforth SSED), using a commercial database program, DBase IV. Phase II of the project will involve investigation of contemporary usage of the noun class system in connected discourse. The two phases of the project are described in the next two subsections.

3.1 The noun database

As mentioned above, all nouns from the SSED are being entered into a database[6]. So far 4784 nouns have been entered, which includes relatively complete coverage for all classes[7]. Each noun in the database is subcategorized according to a combination of morphological and semantic criteria. The morphological criteria are:

(a) noun class affiliation, using the traditional Bantu numbering system;
(b) if the noun is derived, the source of the derivation (e.g. verb stem, adjective stem, etc.).

Most of the categories of the database are semantic, since this is the purpose of the enterprise. For each noun I have included its dictionary definition, as well as classifying its meaning according to several semantic categories, each constituting a separate `field' of the database. The major categories are HUMAN, ANIMAL, PLANT, SHAPE, SIZE, AFFECT, FORCE OF NATURE (such as wind, rain, etc.), and NUMBER[8]. Within each of these categories further, more specific information is provided. For example, within the field HUMAN the nouns are further subclassified into agentive (denoting the agent of an action), kinship term, religious (e.g. `prophet', `saint'), occupation (e.g. `tailor'), etc. (To browse through the database structure in detail, click on the word database. To see a sample of the database with nouns tagged, click on the word sample.)

Use of a database has both advantages and disadvantages. The obvious advantage is that the database makes it possible to store and manipulate very large amounts of data, and to sort it in any number of different ways. Thus a few keystrokes can generate a list of all the nouns in Class 5, all nouns referring to animals, all large three-dimensional hollow objects in Class 9, etc. The database can also be used for other purposes besides those for which it was originally designed. A dictionary is a kind of `culture inventory': just looking at which semantic areas are highly differentiated and which are not yields insight into the interests and preoccupations of the speakers of a language.

Disadvantages of this research method are both practical and theoretical. On the practical side, entering all the nouns from a dictionary onto a database is obviously very time consuming. But one could also regard this as an advantage: reading the dictionary does allow (or force) one to become intimately familiar with the data. A second practical problem is how to avoid entering redundant records. For example, the SSED sometimes lists derived nouns both as separate entries and as sub-entries under the source of the derivation. This problem was avoided by writing a program for DBase IV that would automatically scan the database for homographic entries each time a new noun was entered, and display all previous examples of the relevant form; new homographs were only entered in cases of homonymy.

The database project also raises some theoretical issues. First and most important, the problem of the semantic categories used to tag the nouns. In order to create a database, one has to anticipate which classificatory categories will be useful before entering the data, in a way guessing at the very analysis that the tool is intended to help discover. Use of a bilingual dictionary potentially adds to this problem, by introducing (or imposing) semantic categories of English that may or may not be relevant to Swahili. How can I be sure that I am not just projecting English-based categories onto Swahili? The short answer is, of course, that there is no general way to insure against this. It is the familiar problem of working from the `etic' to the `emic' (in the terminology of Pike 1967). In practice, I tried to minimize the problem by drawing on previous work on noun classification in Bantu languages, especially Denny and Creider (1976), Zawawi (1979), and Spitulnik (1987, 1989), as well as cross-linguistic studies of noun categorization, for example Adams and Conklin (1973), Craig (1986).

Even so, I found it necessary to modify the database in various ways as I went along. Several tags were added or replaced during the data entry. Of course, modifying the tags for previously entered records is very laborious, but this procedure does permit a kind of dialogic relationship between the data and the tagging process.

A second problem, or set of problems, comes from the use of a dictionary as data source. A dictionary is an analysis, not just a description. The compilers make choices about which words to include, how many entries to make for a given form (the familiar problems of polysemy and homonymy), and how to deal with geographic and social variation in pronunciation, grammar, meaning, and usage. Without doing extensive archival research, the user of the dictionary has no way of knowing exactly whose language is represented in it. Moreover, a dictionary is a sociolinguistic act (cf. Hymes 1974 and even more to the point, Fabian 1986). It is produced by people with a certain socio-cultural background, for a certain intended audience, and with certain goals in mind. The SSED, for example, was compiled by a British colonial committee constituted in 1930, composed in part of Christian missionaries and intended for an English-speaking colonial audience (for details, see Whiteley 1969). Without intending criticism of the compilers, who supply a wealth of cultural information about the vocabulary, no dictionary can be free of ethnocentric bias. It is not hard to find obvious examples, e.g. several varieties of fish defined only as `fish, not considered good eating by Europeans'. Although examples like this are not typical of the dictionary as a whole, it is still inevitable that in defining a Swahili term for an English-speaking audience of European cultural background, points regarded as worthy of comment or elaboration would be those where there is a perceived contrast between Swahili language/culture and that of English-speaking Europeans. In any case, the only way of compensating for ethnocentrism in the dictionary is to learn as much as possible about the language and culture from other sources as well.

A second important problem with the SSED as data source derives from the compilers' goal of standardizing the language. Because dictionaries usually have a prescriptive function as well as a descriptive one, it is hard to determine how much variation in the language is being concealed in order to encourage uniformity of usage. For example, from my knowledge of the contemporary language, I expected to find a fair amount of variation in noun class assignment, especially between Classes 5 and 9, both of which contain large numbers of loanwords, and both of which have zero as the most frequent allomorph of the noun class prefix. However, only about 3% of the nouns in the database were listed in the dictionary as variable in noun class membership. Has the situation changed since the 1930's, or were the dictionary makers trying to impose conformity on the data? There is no way to tell. Also, even for those nouns that are listed as variable, no information is given about the nature of the variation: does it stem from dialect variation? variation among individual speakers? variation based on discourse context? Again, there is no way to tell.

For the reasons just outlined, it is desirable to supplement the dictionary material with a wider range of data, especially data from contemporary discourse. Discourse data is important because that is the place to look for the areas of uncertainly and variability in meaning and usage that are represented only sporadically in the SSED. Also, this is the place to find neologisms, loanwords, slang, and other innovative usages that may or may not make it into a dictionary. Looking at how these `uncodified' words interact with the noun class/agreement system should shed light on the semantic reality of the noun classes themselves, and on the semantic functions of the agreement system. This is the plan for Phase II of the project.

3.2 Investigation of noun classes in discourse

For Phase II of the project, I plan to use the electronic corpus of Swahili texts that is currently being compiled by the Department of Asian and African Studies of the University of Helsinki, Finland together with the Institute for Kiswahili Research of the University of Dar es Salaam, Tanzania. The corpus, housed at Helsinki, contains prose texts in Standard Swahili, from books and newspapers, and transcriptions of folkloristic material. The texts have not been coded for morphological and syntactic information, but some text retrieving programs are available, which produce concordances with context ranging from a line to a sentence in length[9].

I already have a list of all wordforms in the Helsinki corpus that are not included in the dictionary compiled by the Institute for Kiswahili Research (Taasisi 1981), with their sentential contexts, for which I profusely thank Arvi Hurskainen. This list allows a look at the syntactic behavior of nouns that have not yet been `codified'. A preliminary scan of the data has already uncovered some interesting examples of agreement with conjoined noun phrases, treatment of acronyms, and nouns with variable agreement patterns, which will be the subject of a later paper.

[Section 4]