Comparing Apples to Apple: The Effects of Stemmers on Topic Models

Rule-based stemmers such as the Porter stemmer are frequently used to preprocess English corpora for topic modeling. In this work, we train and evaluate topic models on a variety of corpora using several different stemming algorithms. We examine several different quantitative measures of the resulting models, including likelihood, coherence, model stability, and entropy. Despite their frequent use in topic modeling, we find that stemmers produce no meaningful improvement in likelihood and coherence and in fact can degrade topic stability.

[1]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[2]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[3]  Sankar K. Pal,et al.  Stemming via Distribution-Based Word Segregation for Classification and Retrieval , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4]  Antske Fokkens,et al.  Offspring from Reproduction Problems: What Replication Failure Teaches Us , 2013, ACL.

[5]  Mark Dredze,et al.  Small Statistical Models by Random Feature Mixing , 2008, ACL 2008.

[6]  Nicola Orio,et al.  A novel method for stemmer generation based on hidden markov models , 2003, CIKM '03.

[7]  David Cornforth,et al.  Effects of Training Datasets on Both the Extreme Learning Machine and Support Vector Machine for Target Audience Identification on Twitter , 2015 .

[8]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[9]  Xiaofeng Wang,et al.  UIPicker: User-Input Privacy Identification in Mobile Applications , 2015, USENIX Security Symposium.

[10]  B. Ramesh,et al.  Evaluation of Stemming Techniques for Text Classification , 2015 .

[11]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction via Topic Decomposition , 2010, EMNLP.

[12]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Gary King,et al.  General purpose computer-assisted clustering and conceptualization , 2011, Proceedings of the National Academy of Sciences.

[15]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[16]  Mani Shankar,et al.  A composite classification model for web services based on semantic & syntactic information integration , 2015, 2015 IEEE International Advance Computing Conference (IACC).

[17]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[18]  Chuan Su Machine Learning for Reducing the Effort of Conducting Systematic Reviews in SE , 2015 .

[19]  Matthew L. Jockers,et al.  Significant themes in 19th-century literature , 2013 .

[20]  Rossitza Setchi,et al.  Enhanced cross-domain document clustering with a semantically enhanced text stemmer (SETS) , 2013, Int. J. Knowl. Based Intell. Eng. Syst..

[21]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[22]  Carina Jacobi,et al.  Quantitative analysis of large amounts of journalistic texts using topic modelling , 2016, Rethinking Research Methods in an Age of Digital Journalism.

[23]  Dongbo Wang,et al.  The influence of word normalization in English document clustering , 2012, 2012 IEEE International Conference on Computer Science and Automation Engineering (CSAE).

[24]  Prasenjit Majumder,et al.  YASS: Yet another suffix stripper , 2007, TOIS.

[25]  Christopher J. Fox,et al.  Strength and similarity of affix removal stemming algorithms , 2003, SIGF.

[26]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[27]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[28]  悠太 菊池,et al.  大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[29]  Anjali Ganesh Jivani,et al.  A Comparative Study of Stemming Algorithms , 2011 .

[30]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[31]  Donna Harman,et al.  How effective is suffixing , 1991 .

[32]  Chris D. Paice,et al.  Another stemmer , 1990, SIGF.

[33]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.