论文信息 - An Algorithmic Approach to English Pluralization

An Algorithmic Approach to English Pluralization

This paper discusses some of the issues involved in designing robust and comprehensive algorithms which convertsingular English nouns, verbs and ad jectives to their appropriate plural forms. Four such algorithms are given: one for each part of speech which inflects in the plural, and a unified algorithm for all such parts of speech. A word comparison algo rithm that can identify words that differ only in their grammatical number is also given. Finally, an overview is given of a full implementation of the var ious algorithms in the Perl [1] programming language. The problem of English plurals The English language is overburdened with idiosyn cratic grammatical features, a legacy of its eclectic accretion over 1500 years [2,3]. One unfortunate con sequence of this otherwise admirable richness is that automatically generating correct English is fraught with difficulty. Composing the simplest of sentences may require quite sophisticated semantic understanding to enable the correct syntax to be chosen. Even at the lexical level it can be a complex matter to cor rectly inflect the individual words of a sentence to reflect their number, person, mood, case, etc. The use of English plurals in synthetic sentences is a case in point. In computing applications, for example, it is quite common to encounter error messages which jar because they do not correctly inflect for grammatical number: Compilation aborted: 1 errors were found Individually, such inelegances are easily overcome (or, more accurately, the inelegance may be trans ferred from the interface to the code): print "Compilation aborted: $count ", ($count==1 ? "error was" : "errors were"), " detected.\n"; Unfortunately, in attempting to generate more complex text, some less tractable problems arise, notably the diversity of plural forms available in English. Consider the difficulty faced by a text generation sys tem (machine or human) in forming plural versions of the following: Her criterion differs from mine. Analysis of this aquarium's fish failed to determine its genus. That phalanx suffered a trauma. This paper presents an algorithmic approach that pro vides (nearly) automatic plural inflections for such examples. Coping with English plurals in synthetic text Existing techniques for dealing with plural inflections in generated text fall into a four categories: indiffer ence, evasion, explication, and automation. The fol lowing sections briefly describe each of these. Ignoring the problem Ignoring issues of pluralization has a long and glori ous history in certain synthetic text generation contexts. Typically, when this approach is used, the pro grammer simply assumes that the number required will always be non-singular and that any cases where a singular does appear will be written off by the user as a "computer glitch" or tolerated as a flaw in the interface. Hence the familiar There were 1 errors message. One might argue that this approach is economically rational, in that the extra cost and complexity involved in identifying and coding around that one special case outweighs the benefit of correctly handling it. This, of course, is the perennial excuse for ugly and ungainly interfaces, and quite unassailable in the estimation of the utilitarian mind. Avoiding the problem English is sufficiently flexible that programmers, faced with the task of generating text of a changeable number, may easily enough recast their synthetic prose into "number-inclusive" forms. The simplest approach is to structure the text so that the grammati c l number of the various parts of speech in a sen tence is fixed, regardless of the actual number of items being referred to. Hence: Number of errors: 1 Number of errors: 10 A common (if somewhat clumsy) alternative is to bet both ways and structure the sentence so that it will read correctly in either grammatical number: 1 error(s) found. 10 error(s) found. Evasion techniques such as these solve the problem of "canned" synthetic text, but do so either by craving the readers' indulgence (of threadbare English) or their complicity (in ignoring the inappropriate sense of a schizophrenic construction). However, in general text generation, such terse and artificial structures may be inappropriate or simply unachievable. A "manual" scheme One variation on the "each-way bet" approach is for the programmer to explicitly provide both singular and plural forms and then have the system select the correct form according to the actual number required, For example, consider a subroutine: sub select_pl($$) { my ($word, $count) = @_; $word =~ s{$([̂ )/]*)/([̂ )]*)$} {$count==1 ? $1 : $2 #ge}; return $word; } which allows the programmer to code synthetic text generation as follows: print $count, select_pl(" error(/s) w(as/ere)",$count), " found\n"; This approach neatly solves the problem of correctly inflecting "canned" text for number, but is not easily adapted to handle the more general problems encoun tered when the text is not pre-determined. Pluralizing algorithms The simplest algorithm for generating arbitrary English plurals is simply to add -s to each word ( clam → clams, storey → storeys, bag → bags, etc.). Of course, this approach fails miserably on many special cases ( cla s → classes, story → stories, box → boxes), and on the hundreds of irregular plural English nouns ( criterion → criteria, stigma → stigmata, ox → oxen). Nor does it cater for verbs ( classifies → classify, stores → store, bobs → bob) or adjectives ( my → our, her → their, Bob's → Bobs'). More complex algorithms that cope with specific suffixes (-ss → -sses, -y → -ies, etc.) can be specified, but pure suffix-based approaches will still be prone to exceptions and meta-exceptions. For example: -y becomes -ies, except after a vowel (when it becomes -y ), except for soliloquy (which takes -ies). A usable pluralization algorithm must therefore cope with three categories of plural formation: universal defaults, general suffix-based rules, and specific ex ceptional cases. The following section examines each of these categories in more detail. Categories of English plurals Universal rules Although described here first, and encountered most frequently, the universal rules of plural inflection are the "last resort" in an algorithmic sense. That is, these rules only apply when all other more specific rules or special cases (see below) are inapplicable. The rules themselves are well-known and need no elaboration. By default: • Nouns are made plural by appending -s. • Verbs are made plural by removing any trail ing -s (and otherwise do not change). • Adjectives and adverbs do not change when made plural. Suffix categories There are, however, an enormous number of exceptions to these defaults [4]. Most such exceptions are still regular (in the sense that they occur in predictable patterns), but are specific to a particular word suffix. For example, nouns that end in -ss universally become -sses in the plural (and vice versa for verbs). Likewise, nouns which end in a vowel followed by almost always become -ies in the plural. Certain types of adjectives also inflect in this way. For example, possessive adjectives that end in -'s or -' in the singular are made plural by forming the plural of the root word and appending an apostrophe (unless the root's plural does not itself end in -s, in which case -'s is appended). Hence cat's becomes cats', axis' becomes axes', whilst child's becomes children's. Other suffix categories arise because words of foreign origin (most commonly Ancient Greek or Latin) have retained a non-anglicized plural inflection. Hence criterion becomes criteria, nucleus becomes nuclei, and matrix becomes matrices. Dealing with such cat egories is complicated by the fact that many other imports have been wholly or partially anglicized. Hence although criterion always forms its plural with -a, ganglion may take either -s or -a (ganglions or ganglia), whilst bastion is always inflected with -s. Occasionally the anglicized and "classical" plural forms of a word may both be in common use, but with distinct meanings. Thus a copy editor might remove appendices, whereas a surgeon would remove appendixes. The correct inflection of words derived from Latin can be particularly complex, since the same suffix may form different Latinate plurals depending on the declension (or sometimes the part of speech) of the Singular suffix Anglicized plural Classical plural Example -a (none) -ae alga → algae -a -as -ae nova → novas/novae -a -as -ata dogma → dogmas/dogmata -an -en (none) woman → women -ch -ches (none) church → churches -eau -eaus -eaux chateau → chateaus/chateaux -en -ens -ina foramen → foramens/foramina -ex (none) -ices codex → codices -ex -exes -ices index → indexes/indices -f(e) -ves (none) life → lives -ieu -ieus -ieux milieu → mileus/milieux -is (none) -es basis → bases -is -ises -ides iris → irises /irides -ix -ixes -ices matrix → matrixes/matrices -nx -nxes -nges phalanx → phalanxes /phalanges -o -oes (none) potato → potatoes -o -os (none) photo → photos -o (none) -i graffito → graffiti -o -os -i tempo → tempos/tempi -on (none) -a aphelion → aphelia -on -ons -a ganglion → ganglions/ganglia -oo-ee(none) foot → feet -oof -oofs -ooves hoof → hoofs/hooves -s -s (none) series → series -s -ses (none) atlas → altases -sh -shes (none) wish → wishes -um (none) -a bacterium → bacteria -um -ums -a medium → mediums/media -us (none) -era genus → genera -us (none) -i stimulus → stimuli -us -uses -era opus → opuses/opera -us -uses -i radius → radiuses/radii -us -uses -ora corpus → corpuses/corpora -us -uses -us status → statuses/status -x -xes (none) box → boxes -y -ies (none) ferry → ferries Table 1: Major English suffix categories. original. Thus the plural of stimulus (second declension) is stimuli, and that of genus (third declension) is genera. Status (fourth declension) is traditionally unchanged in the plural, whilst ignoramus (a first person plural Latin verb) has been wholly

Damian M Conway

[1] A. J. Thomson. A Practical English Grammar Fourth Edition , 1986 .

[2] R. MacNeil,et al. The Story of English , 1986 .