Iterative Hard Thresholding for Keyword Extraction from Large Text Corpora

To better understand and analyze text corpora, such as the news, it is often useful to extract keywords that are meaningfully associated with a given topic. A corpus of documents labeled by their topic can be used to approach this as a learning problem. We consider this problem through the lens of statistical text analysis, using bag-of-words frequencies as features for a sparse linear model. We demonstrate, through numerical experiments, that iterative hard thresholding (IHT) is a practical and effective algorithm for keyword-extraction from large text corpora. In fact, our implementation of IHT can quickly analyze more than 800,000 documents, returning keywords comparable to algorithms solving a Lasso problem-formulation, with significantly less computation time. Further, we generalize the analysis of the IHT algorithm to show that it is stable for rank deficient matrices, as those arising from our bag-of-words model often are.

[1]  Yehuda Lindell,et al.  Text Mining at the Term Level , 1998, PKDD.

[2]  Feifan Liu,et al.  Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts , 2009, NAACL.

[3]  Yu Huang,et al.  Automatic key term extraction from spoken course lectures using branching entropy and prosodic/semantic features , 2010, 2010 IEEE Spoken Language Technology Workshop.

[4]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[5]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[6]  Luke Miratrix,et al.  Discovering word associations in news media via feature selection and sparse classification , 2010, MIR '10.

[7]  David G. Underhill Exploring Dimensionality Reduction for Text Mining , 2007 .

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Branimir Boguraev,et al.  Automatic Glossary Extraction: Beyond Terminology Identification , 2002, COLING.

[10]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[11]  M Marcus New trends in natural language processing: statistical natural language processing. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[13]  Thomas Blumensath,et al.  Accelerated iterative hard thresholding , 2012, Signal Process..

[14]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[15]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[16]  Dale Schuurmans,et al.  Combining Naive Bayes and n-Gram Language Models for Text Classification , 2003, ECIR.

[17]  Charles L. Byrne,et al.  Iterative Algorithms in Inverse Problems , 2006 .

[18]  Anette Hulth,et al.  Enhancing Linguistically Oriented Automatic Keyword Extraction , 2004, NAACL.

[19]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[20]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[21]  Laurent El Ghaoui,et al.  Understanding large text corpora via sparse machine learning , 2013, Stat. Anal. Data Min..

[22]  Mike E. Davies,et al.  Iterative Hard Thresholding for Compressed Sensing , 2008, ArXiv.

[23]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Helmut Berger,et al.  A Comparison of Text-Categorization Methods Applied to N-Gram Frequency Statistics , 2004, Australian Conference on Artificial Intelligence.

[25]  Laurent El Ghaoui,et al.  Sparse Machine Learning Methods for Understanding Large Text Corpora. , 2011 .

[26]  T. Blumensath,et al.  Iterative Thresholding for Sparse Approximations , 2008 .

[27]  Anette Hulth,et al.  Automatic Keyword Extraction Using Domain Knowledge , 2001, CICLing.

[28]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.