论文信息 - FindStem: Analysis and Evaluation of a Turkish Stemming Algorithm

FindStem: Analysis and Evaluation of a Turkish Stemming Algorithm

In this paper, we evaluate the effectiveness of a new stemming algorithm, FINDSTEM, for use with Turkish documents and queries, and compare the use of this algorithm with the other two previously defined Turkish stemmers, namely ”A-F” and ”L-M” algorithms. Of them, the FINDSTEM and A-F algorithms employ inflectional and derivational stemmers, whereas the L-M one handles only inflectional rules. Comparison of stemming algorithms was done manually using 5,000 distinct words out of which the FINDSTEM, A-F, and L-M failed on, in respect, 49, 270, and 559 cases. A medium-size collection, which is comprised of 2,468 law records with 280K document words, 15 queries in natural language with average length of 17 search words, and a complete relevancy information for each query, was used for the effectiveness of the stemming algorithm FINDSTEM. We localized SMART retrieval system in terms of a stopping list, introduction of Turkish characters, i.e., the ISO8859-9 (Latin-5) code set, a stemming algorithm (FINDSTEM), and a Turkish translation at message level. Our results based on average precision values at 11-point recall levels shows that indexing document as well as search terms with the use of FINDSTEM for stemming is clearly and consistently more effective than the one where the terms are indexed as they are (that is, no stemming at all).

Hayri Sever | Yiltan Bitirim

[1] Donna Harman,et al. How effective is suffixing , 1991 .

[2] Kepa Sarasola,et al. Automatic morphological analysis of Basque , 1996 .

[3] Robert Krovetz,et al. Viewing morphology as an inference process , 1993, Artif. Intell..

[4] David A. Hull. Stemming algorithms: a case study for detailed evaluation , 1996 .

[5] Kemal Oflazer,et al. Spelling Correction in Agglutinative Languages , 1994, ANLP.

[6] Juan Llorens Morillo,et al. An algorithm for term conflation based on tree structures , 2002, J. Assoc. Inf. Sci. Technol..

[7] Evan L. Antworth. Glossing text with the PC-KIMMO morphological parser , 1992, Comput. Humanit..

[8] Richard Sproat,et al. Morphology and computation , 1992 .

[9] Ari Pirkola,et al. Morphological typology of languages for IR , 2001, J. Documentation.

[10] Phil Munson. Teach Yourself Physics , 1989 .

[11] Kemal Oflazer,et al. Two-level Description of Turkish Morphology , 1993, EACL.

[12] Donna K. Harman,et al. How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[13] Chris D. Paice. An evaluation method for stemming algorithms , 1994, SIGIR '94.

[14] Geoffrey L. Lewis. Teach Yourself Turkish , 1953 .

[15] P. Willett,et al. Effectiveness of stemming for Turkish text retrieval , 2000 .

[16] Peter Willett,et al. The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data , 1992, J. Am. Soc. Inf. Sci..