A novel method for stemmer generation based on hidden markov models

In this paper, we present a method based on Hidden Markov Models (HMMs) to generate statistical stemmers. Using a list of words as training set, the method estimates the HMM parameters which are used to calculate the most probable stem for an arbitrary word. Stemming is performed by computing the most probable path, through the HMM states, corresponding to the input word. Linguistic knowledge or a training set of manually stemmed words are not required. We describe the method and the results of the experiments carried out using standard test collections for five different languages.

[1]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[2]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[3]  Martin Braschler,et al.  Stemming and Decompounding for German Text Retrieval , 2003, ECIR.

[4]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[5]  Franklin A. Graybill,et al.  Introduction to The theory , 1974 .

[6]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[7]  Nicola Ferro,et al.  The Effectiveness of a Graph-Based Algorithm for Stemming , 2002, ICADL.

[8]  Chris D. Paice,et al.  Constructing literature abstracts by computer: Techniques and prospects , 1990, Inf. Process. Manag..

[9]  Giorgio Maria Di Nunzio,et al.  The University of Padova at CLEF 2003: Experiments to Evaluate Probabilistic Models for Automatic Stemmer Generation and Query Word Translation , 2003, CLEF.

[10]  Gerald Salton,et al.  Automatic text processing , 1988 .

[11]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[12]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Djoerd Hiemstra,et al.  Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval, University of Massachusetts Amherst, September 2002 , 2003, SIGF.

[15]  Peter Willett,et al.  The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data , 1992, J. Am. Soc. Inf. Sci..

[16]  Jacques Savoy,et al.  A Stemming Procedure and Stopword List for General French Corpora , 1999, J. Am. Soc. Inf. Sci..

[17]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[18]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[19]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[20]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[21]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.