Computational nonlinear morphology with emphasis on semitic languages

Computational morphology would be an almost trivial exercise if every language were like English. Here, chopp-ing off the occasion-al affix-es, of which there are not too many, is sufficient to isolate the stem, perhaps modulo a few (morpho)graphemic rules to handle phenomena like the consonant doubling we just saw in chopping. This relative ease with which one can identify the core meaning component of a word explains the success of rather simple stemming algorithms for English or the way in which most part-of-speech (POS) taggers get away with just examining bounded initial and final substrings of unknown words for guessing their parts of speech. In contrast, this book outlines a computational approach to morphology that explicitly includes languages from the Semitic family, in particular Arabic and Syriac, where the linearity hypothesis—every word can be built via string concatenation of its component morphemes—seems to break down (we will take up the validity of that assumption below). Example 1 illustrates the problem at hand with Syriac verb forms of the root {q1t.2l3} ‘notion of killing’ (from Kiraz [1996]).