YASS: Yet another suffix stripper

Stemmers attempt to reduce a word to its stem or root form and are used widely in information retrieval tasks to increase the recall rate. Most popular stemmers encode a large number of language-specific rules built over a length of time. Such stemmers with comprehensive rules are available only for a few languages. In the absence of extensive linguistic resources for certain languages, statistical language processing tools have been successfully used to improve the performance of IR systems. In this article, we describe a clustering-based approach to discover equivalence classes of root words and their morphological variants. A set of string distance measures are defined, and the lexicon for a given text collection is clustered using the distance measures to identify these equivalence classes. The proposed approach is compared with Porter's and Lovin's stemmers on the AP and WSJ subcollections of the Tipster dataset using 200 queries. Its performance is comparable to that of Porter's and Lovin's stemmers, both in terms of average precision and the total number of relevant documents retrieved. The proposed stemming algorithm also provides consistent improvements in retrieval performance for French and Bengali, which are currently resource-poor.

[1]  Carol Peters Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation , 2000 .

[2]  Fredric C. Gey,et al.  Cross language information retrieval: a research roadmap , 2002, SIGF.

[3]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[4]  Chris Buckley,et al.  Using Query Zoning and Correlation Within SMART: TREC 5 , 1996, TREC.

[5]  Nicola Ferro,et al.  A probabilistic model for stemmer generation , 2005, Inf. Process. Manag..

[6]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[7]  George W. Adamson,et al.  The use of an association measure based on character structure to identify semantically related pairs of words and document titles , 1974, Inf. Storage Retr..

[8]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[9]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[10]  Yiming Yang,et al.  Unsupervised Learning of Arabic Stemming Using a Parallel Corpus , 2003, ACL.

[11]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[12]  Derrick Higgins,et al.  Automatic Language-Specific Stemming in Information Retrieval , 2000, CLEF.

[13]  Antonio Zamora,et al.  Automatic spelling correction in scientific and scholarly text , 1984, CACM.

[14]  J. Hsu Multiple Comparisons: Theory and Methods , 1996 .

[15]  Anne N. De Roeck,et al.  A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots , 2000, ACL.

[16]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[17]  Douglas W. Oard,et al.  CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation , 2000, CLEF.

[18]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[19]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[20]  Ananthakrishnan Ramanathan,et al.  A Lightweight Stemmer for Hindi , 2003 .

[21]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[22]  Robert M. Hayes The SMART retrieval system; experiments in automatic document processing: Edited by Gerard Salton, Prentice-Hall, Englewood Cliffs, New Jersey, 1971. 556 pages , 1973 .

[23]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .