An Unsupervised Method to Improve Spanish Stemmer

We evaluate the effectiveness of using our edit distances algorithm to improving an unsupervised language-independent stemming method. The main idea is to create morphological families through the automatic words grouping using our distance. Based on that grouping, we make a stemming process. The capacity of the edit distance algorithm in the task of words clustering and the ability of our method to generate the correct stem for Spanish was evaluated. A good result (98% precision) for the morphological families' creation and also a remarkable 99.85% of correct stemming was obtained.

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  Chris D. Paice,et al.  Another stemmer , 1990, SIGF.

[3]  Stephen F. Weiss,et al.  Word segmentation by letter successor varieties , 1974, Inf. Storage Retr..

[4]  Isabelle Moulinier,et al.  West Group at CLEF2000: Non-English Monolingual Retrieval , 2000, CLEF.

[5]  Peter Willett,et al.  The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data , 1992, J. Am. Soc. Inf. Sci..

[6]  Jacques Savoy,et al.  A Stemming Procedure and Stopword List for General French Corpora , 1999, J. Am. Soc. Inf. Sci..

[7]  K. Bretonnel Cohen,et al.  Biological, translational, and clinical language processing , 2007 .

[8]  Suresh Manandhar,et al.  Unsupervised Learning of Word Segmentation Rules with Genetic Algorithms and Inductive Logic Programming , 2001, Machine Learning.

[9]  Xin Li,et al.  Context sensitive stemming for web search , 2007, SIGIR.

[10]  Martin Braschler,et al.  Experiments with the Eurospider Retrieval System for CLEF 2000 , 2000, CLEF.

[11]  Harald Hammarström Unsupervised Learning of Morphology: Survey, Model, Algorithm and Experiments , 2007 .

[12]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[13]  Carol Peters,et al.  Cross-Language Information Retrieval and Evaluation , 2001, Lecture Notes in Computer Science.

[14]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[15]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[16]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[17]  Peter Willett,et al.  The effectiveness of stemming for natural‐language access to Slovene textual data , 1992 .

[18]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[19]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[20]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[21]  W. John Wilbur Unsupervised Learning of the Morpho-Semantic Relationship in MEDLINE , 2007, BioNLP@ACL.

[22]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[23]  James Mayfield,et al.  Single n-gram stemming , 2003, SIGIR.