论文信息 - Stemming of French words based on grammatical categories

Stemming of French words based on grammatical categories

Automatic indexing systems use suffix stripping algorithms to cluster various words derived from a common root under the same stem. Currently, removing affixes to either a context-free or context-sensitive operation, where the context refers to the remaining stem. In this article, we propose a suffixing algorithm which uses grammatical categories to enhance the stemming process. This approach supports the use of foreign languages. In our case, the language is French, and a morphological analysis is required for removing inflectional suffixes or morphosyntactic variants of a lemma. After this analysis, we implement a suffix stripping algorithm which uses a dictionary and the grammatical categories to remove derivational suffixes. Our approach always returns a linguistically correct lemma, but not necessarily the “right” one. Based on our tests, this solution is an attractive one, with a mean error rate of 16%. We finish by explaining why we cannot expect significantly better results with this approach.

Jacques Savoy

[1] Gerard Salton,et al. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[2] Christiane Laeufer,et al. Le Bon Usage , 1986 .

[3] Julie B. Lovins. Error evaluation for stemming algorithms as clustering algorithms , 1971 .

[4] Yaacov Choueka,et al. Disambiguation by short contexts , 1985, Comput. Humanit..

[5] Donna Harman,et al. How effective is suffixing , 1991 .

[6] Chris D. Paice,et al. Another stemmer , 1990, SIGF.

[7] Jacques Savoy,et al. Bayesian Inference Networks and Spreading Activation in Hypertext Systems , 1992, Inf. Process. Manag..

[8] Christopher J. Fox,et al. A stop list for general text , 1989, SIGF.

[9] C. D. Paice. Information retrieval and the computer , 1977 .

[10] Julie Beth Lovins,et al. Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.