A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics

Abstract Word Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient corpus without using any language related rules. In this article, we proposed a fully unsupervised language-independent text stemming technique that clusters morphologically related words from the corpus of the language using both lexical and co-occurrence features such as lexical similarity, suffix knowledge, and co-occurrence similarity. The method applies to a wide range of inflectional languages as it identifies morphological variants formed through different linguistic processes such as affixation, compounding, conversion, etc. The proposed approach has been tested in Information Retrieval application for four languages (English, Marathi, Hungarian, and Bengali) using standard TREC, CLEF, and FIRE test collections. A significant improvement over word-based retrieval, five other corpus-based stemmers, and rule-based stemmers has been achieved in all the languages. Besides, information retrieval, the proposed approach has also been tested in text classification and inflection removal tasks. Our algorithm excelled over other baseline methods in all the test scenarios. Thus, we successfully achieved the objective of developing a multipurpose stemming algorithm that cannot only be used for information retrieval task but also for non-traditional tasks such as text classification, sentiment analysis, inflection removal, etc.

[1]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[2]  Bing Liu,et al.  Mining Opinion Features in Customer Reviews , 2004, AAAI.

[3]  Douglas W. Oard,et al.  CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation , 2000, CLEF.

[4]  Stephen E. Robertson,et al.  Effective and Robust Query-Based Stemming , 2013, TOIS.

[5]  Christopher J. Fox,et al.  Strength and similarity of affix removal stemming algorithms , 2003, SIGF.

[6]  Rafael Muñoz,et al.  An Unsupervised Method to Improve Spanish Stemmer , 2011, NLDB.

[7]  Jacques Savoy,et al.  Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages , 2010, TALIP.

[8]  John Goldsmith,et al.  An algorithm for the unsupervised learning of morphology , 2006, Natural Language Engineering.

[9]  Peter Willett,et al.  An evaluation of some conflation algorithms for information retrieval , 1981 .

[10]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[11]  Vishal Gupta,et al.  An Efficient Corpus-Based Stemmer , 2017, Cognitive Computation.

[12]  Gosse Bouma,et al.  Accurate Stemming of Dutch for Text Classification , 2001, CLIN.

[13]  Prasenjit Majumder,et al.  YASS: Yet another suffix stripper , 2007, TOIS.

[14]  Miloslav Konopík,et al.  HPS: High precision stemmer , 2015, Inf. Process. Manag..

[15]  Vishal Gupta,et al.  A Novel Hybrid Text Summarization System for Punjabi Text , 2015, Cognitive Computation.

[16]  Nicola Ferro,et al.  A probabilistic model for stemmer generation , 2005, Inf. Process. Manag..

[17]  Swapan K. Parui,et al.  GRAS: An effective and efficient stemming algorithm for information retrieval , 2011, TOIS.

[18]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[19]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[20]  Qiang Zhou,et al.  Multilingual Sentiment Analysis: State of the Art and Independent Comparison of Techniques , 2016, Cognitive Computation.

[21]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[22]  Sankar K. Pal,et al.  Stemming via Distribution-Based Word Segregation for Classification and Retrieval , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[23]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[24]  Swapan K. Parui,et al.  A Fast Corpus-Based Stemmer , 2011, TALIP.

[25]  Amber E. Boydstun,et al.  RTextTools: A Supervised Learning Package for Text Classification , 2013, R J..

[26]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[27]  Jacques Savoy,et al.  Indexing and stemming approaches for the Czech language , 2009, Inf. Process. Manag..

[28]  Marenglen Biba,et al.  Boosting Text Classification through Stemming of Composite Words , 2013, ISI.

[29]  Kristina Toutanova,et al.  Applying Morphology Generation Models to Machine Translation , 2008, ACL.

[30]  Donna Harman,et al.  How effective is suffixing , 1991 .

[31]  Chris D. Paice,et al.  Another stemmer , 1990, SIGF.

[32]  Vishal Gupta,et al.  A systematic review of text stemming techniques , 2016, Artificial Intelligence Review.