Software Framework for Topic Modelling with Large Corpora

Large corpora are ubiquitous in today's world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). We identify gap in existing VSM implementations, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. In this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ.

[1]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[2]  B. A. Tague,et al.  UNIX time-sharing system: Foreword , 1978, The Bell System Technical Journal.

[3]  Vijay V. Raghavan,et al.  Vector Space Model of Information Retrieval - A Reevaluation , 1984, SIGIR.

[4]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[5]  R. F. Boisvert,et al.  The Matrix Market Exchange Formats: Initial Design | NIST , 1996 .

[6]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[7]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[8]  William H. Press,et al.  Numerical recipes in C , 2002 .

[9]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[12]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[13]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Ian H. Witten,et al.  Weka-A Machine Learning Workbench for Data Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[15]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[16]  M. Brand,et al.  Fast low-rank modifications of the thin singular value decomposition , 2006 .

[17]  Genevieve Gorrell,et al.  Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing , 2006, EACL.

[18]  Robert Wall,et al.  Reading tea leaves , 2007 .

[19]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[20]  Iryna Gurevych,et al.  Towards Enhanced Interoperability for Large HLT Systems : UIMA for NLP , 2008 .

[21]  Niko Wilbert,et al.  Modular Toolkit for Data Processing (MDP): A Python Data Processing Framework , 2008, Frontiers Neuroinformatics.

[22]  Petr Sojka An Experience with Building Digital Open Access Repository DML-CZ , 2009 .

[23]  John D. Lafferty,et al.  Visualizing Topics with Multi-Word Expressions , 2009, 0907.1013.

[24]  Francis Maes,et al.  Nieme: Large-Scale Energy-Based Models , 2009, J. Mach. Learn. Res..

[25]  Jin-Hwan Cho Software & Tools , 2009 .

[26]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[27]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.