Unification and the computational analysis of arabic

1. I n t r o d u c t i o n The field of computat ional linguistics possesses some remarkable lacunae. A great deal of work has been devoted to the efficient, reasoned parsing of syntax; as a result all but a very few syntactic theories have been at least partially implemented in an a t t empt to arrive at this goal. Morphological analysis has conversely been deemphasized, in part because of the prevailing emphasis on syntax in linguistics, but mainly because the vast majori ty of work in natural language processing has been done by English-speakers on English, a language which has the interesting and relatively rare peculiarity of having very little morphology. In an analytic language like English, it is a perfectly feasible option -given the memory capacities and the power of modern computat ional machinery -simply to list all possible forms of a word, and allow the the machine to access these forms directly in the lexicon, as if they were ununalyzable. Even when morphological analysis is incorporated as part of an English system, only, u few rules are needed to handle almost all inflectional variants of an English word. Systems with a grand total of six rules -one to handle -s noun plurals, one to handle the third singular present suffix -s, one to handle -ing participials, one to handle the past tense -ed, one to handle the comparat ive -er, and one to handle the superlative -est -can account for all regular morphology in the language. The instant one goes beyond English, however, efficient morphological analysis becomes vitally important . In a language such as Russian, or even Spanish, the number of possible forms becomes so large that it is no longer reasonable to list them. They have to be analyzed from the surface form to a form which can be looked up in a lexicon, typically by means of rules which match partial word-pat terns against candidate strings, strip off affixes, and modify stems to conform to their canonical dictionary entries. It is not surprising that the more innovative recent proposals regarding morphological processing have come from speakers of languages other than English, most prominent ly the Finn Koskenniemi (1983), who uses parallel rules and mini-lexica tha t are in essence the continuation classes for each type of rule operation. Unfortunately, both pat tern marchers and orthodox versions of Koskenniemi's two-level system suffer from three major problems. First, they exhibit an extreme dichotomy between morphology and syntax. The morphological component exists only to provide data to the syntactic component, da ta which are used independently of the morphological component. Changes to the syntactic component are quite independent of changes to the morphology, and any alterations made in one have to be made laboriously in the other. Obviously, input from the morphological component must be acceptable to the syntax. In evolving systems with very large morphological rule bases, ensuring consistency between syntax and morphology becomes a real problem. If a system at one stage utilizes a feature such as '3sgm', for example, and then at a later stage splits this feature into two features, '3sg' and 'm', so that the syntax can access these two features separately, this change will have to be made painstakingly in every rule which referred to the older feature.