A universal information theoretic approach to the identification of stopwords

One of the most widely used approaches in natural language processing and information retrieval is the so-called bag-of-words model. A common component of such methods is the removal of uninformative words, commonly referred to as stopwords. Currently, most practitioners use manually curated stopword lists. This approach is problematic because it cannot be readily generalized across knowledge domains or languages. As a result of the difficulty in rigorously defining stopwords, there have been few systematic studies on the effect of stopword removal on algorithm performance, which is reflected in the ongoing debate on whether to keep or remove stopwords. Here we address this challenge by formulating an information theoretic framework that automatically identifies uninformative words in a corpus. We show that our framework not only outperforms other stopword heuristics, but also allows for a substantial reduction of document size in applications of topic modelling. Our findings can be readily generalized to other bag-of-words-type approaches beyond language such as in the statistical analysis of transcriptomics, audio or image corpora.To better extract meaning from natural language, some less informative words can be removed before a model is trained, which is usually done by using manually curated lists of stopwords. A new information theoretic approach can identify uninformative words automatically and more accurately.

[1]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2]  Michael I. Jordan,et al.  Combinatorial Clustering and the Beta Negative Binomial Process , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Xiaotie Deng,et al.  Automatic construction of Chinese stop word list , 2006 .

[4]  D. Rebholz-Schuhmann,et al.  Text-mining solutions for biomedical research: enabling integrative biology , 2012, Nature Reviews Genetics.

[5]  David M. Mimno,et al.  Applications of Topic Models , 2017, Found. Trends Inf. Retr..

[6]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[7]  S. Gries Dispersions and adjusted frequencies in corpora , 2008 .

[8]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[9]  A. Karr Exploratory Data Mining and Data Cleaning , 2006 .

[10]  Leto Peel,et al.  The ground truth about metadata and community detection in networks , 2016, Science Advances.

[11]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[12]  Iadh Ounis,et al.  Automatically Building a Stopword List for an Information Retrieval System , 2005, J. Digit. Inf. Manag..

[13]  Konrad P. Körding,et al.  A high-reproducibility and high-accuracy method for automated topic classification , 2014, ArXiv.

[14]  Sunil Arya,et al.  Space-time tradeoffs for approximate nearest neighbor searching , 2009, JACM.

[15]  Konrad P. Körding,et al.  Science Concierge: A Fast Content-Based Recommendation System for Scientific Publications , 2016, PloS one.

[16]  James A. Evans,et al.  Machine Translation: Mining Text for Social Theory , 2016 .

[17]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Stein Aerts,et al.  cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data , 2019, Nature Methods.

[20]  Boxi Kang,et al.  Landscape of Infiltrating T Cells in Liver Cancer Revealed by Single-Cell Sequencing , 2017, Cell.

[21]  Kevin D. Seppi,et al.  Preprocessor Selection for Machine Learning Pipelines , 2018, ArXiv.

[22]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[23]  David M. Mimno,et al.  Comparing Apples to Apple: The Effects of Stemmers on Topic Models , 2016, TACL.

[24]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[25]  Chong Wang,et al.  Online Variational Inference for the Hierarchical Dirichlet Process , 2011, AISTATS.

[26]  David B. Dunson,et al.  Probabilistic topic models , 2012, Commun. ACM.

[27]  L. Foster,et al.  Evaluating measures of association for single-cell transcriptomics , 2019, Nature Methods.

[28]  Doug Downey,et al.  A new evaluation framework for topic modeling algorithms based on synthetic corpora , 2019, AISTATS.

[29]  Marcelo A. Montemurro,et al.  Towards the Quantification of the Semantic Information Encoded in Written Language , 2009, Adv. Complex Syst..

[30]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[31]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[32]  Frank Lyko,et al.  Single-cell transcriptomes of the aging human skin reveal loss of fibroblast priming , 2019, bioRxiv.

[33]  Santo Fortunato,et al.  Weight Thresholding on Complex Networks , 2018, Physical Review E.

[34]  Måns Magnusson,et al.  Pulling Out the Stops: Rethinking Stopword Removal for Topic Models , 2017, EACL.

[35]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[36]  Joel Nothman,et al.  Stop Word Lists in Free Open-source Software Packages , 2018 .