Stemming via Distribution-Based Word Segregation for Classification and Retrieval

A novel corpus-based method for stemmer refinement, which can provide improvement in both classification and retrieval, is described. The method models the given words as generated from a multinomial distribution over the topics available in the corpus and includes a procedurelike sequential hypothesis testing that enables grouping together distributionally similar words. The system can refine any stemmer, and its strength can be controlled with parameters that reflect the amount of tolerance to be allowed in computing the similarity between the distributions of two words. Although obtaining the morphological roots of the given words is not the primary objective, the algorithm automatically does that to some extent. Despite a huge reduction in dictionary size, classification accuracies are seen to improve significantly when the proposed system is applied on some existing stemmers for classifying 20 Newsgroups and WebKB data. The refinements obtained are also suitable for cross-corpus stemming. Regarding retrieval, its superiority is extensively demonstrated with respect to four existing methods

[1]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[2]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[3]  Martijn Spitters,et al.  Comparing feature sets for learning text categorization , 2000, RIAO.

[4]  Gosse Bouma,et al.  Accurate Stemming of Dutch for Text Classification , 2001, CLIN.

[5]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[6]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[7]  Ellen Riloff,et al.  Little words can make a big difference for text classification , 1995, SIGIR '95.

[8]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[9]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[10]  Lillian Lee,et al.  Measures of Distributional Similarity , 1999, ACL.

[11]  Chris D. Paice,et al.  Another stemmer , 1990, SIGF.

[12]  Chris D. Paice Method for Evaluation of Stemming Algorithms Based on Error Counting , 1996, J. Am. Soc. Inf. Sci..

[13]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[14]  William R. Hersh,et al.  A Comparison of Techniques for Classification and Ad Hoc Retrieval of Biomedical Documents , 2005, TREC.

[15]  O. Vorobyev,et al.  Discrete multivariate distributions , 2008, 0811.0406.

[16]  Vibhu O. Mittal,et al.  Stemming and its effects on TFIDF ranking. , 2000, SIGIR 2000.

[17]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[18]  Christopher J. Fox,et al.  Strength and similarity of affix removal stemming algorithms , 2003, SIGF.

[19]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[20]  J. Andel Sequential Analysis , 2022, The SAGE Encyclopedia of Research Design.

[21]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[22]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[23]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[24]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[25]  José Castillo,et al.  A Generalization of the Method for Evaluation of Stemming Algorithms Based on Error Counting , 2005, SPIRE.

[26]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[27]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[28]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[29]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[30]  Nicola Ferro,et al.  A probabilistic model for stemmer generation , 2005, Inf. Process. Manag..

[31]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[32]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[33]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[34]  R. Stephenson A and V , 1962, The British journal of ophthalmology.