TMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections

A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that are presented in the form of large sparse term-document matrices (tdm). We present TMG, a research and teaching toolbox for the generation of sparse tdms from text collections and for the incremental modification of these tdms by means of additions or deletions. The toolbox is written entirely in MATLAB, a popular problem-solving environment that is powerful in computational linear algebra, in order to streamline document preprocessing and prototyping of algorithms for information retrieval. Several design issues that concern the use of MATLAB sparse infrastructure and data structures are addressed. We illustrate the use of the tool in numerical explorations of the effect of stemming and different term-weighting policies on the performance of querying and clustering tasks.

[1]  J. Davenport Editor , 1960 .

[2]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  Gene H. Golub,et al.  Matrix computations , 1983 .

[5]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[6]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[7]  John R. Gilbert,et al.  Sparse Matrices in MATLAB: Design and Implementation , 1992, SIAM J. Matrix Anal. Appl..

[8]  Michael W. Berry,et al.  SVDPACKC (Version 1.0) User''s Guide , 1993 .

[9]  James Allan,et al.  Automatic structuring and retrieval of large text files , 1994, CACM.

[10]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[11]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[12]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[13]  Prabhakar Raghavan,et al.  Sparse matrix reordering schemes for browsing hypertext , 1996 .

[14]  Tamara G. Kolda,et al.  Limited-memory matrix methods with applications , 1997 .

[15]  Michael W. Berry,et al.  Large-Scale Information Retrieval with Latent Semantic Indexing , 1997, Inf. Sci..

[16]  Michael W. Berry,et al.  Downdating the Latent Semantic Indexing Model for Conceptual Information Retrieval , 1998, Comput. J..

[17]  Chao Yang,et al.  ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods , 1998, Software, environments, tools.

[18]  Tamara G. Kolda,et al.  A semidiscrete matrix decomposition for latent semantic indexing information retrieval , 1998, TOIS.

[19]  E. Chisholm,et al.  New Term Weighting Formulas for the Vector Space Method in Information Retrieval , 1999 .

[20]  Jon M. Kleinberg,et al.  Applications of linear algebra in information retrieval and hypertext analysis , 1999, PODS '99.

[21]  Hongyuan Zha,et al.  On Updating Problems in Latent Semantic Indexing , 1997, SIAM J. Sci. Comput..

[22]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[23]  Rie Kubota Ando Latent semantic space: iterative scaling improves precision of inter-document similarity measurement , 2000, SIGIR '00.

[24]  Clifford Behrens,et al.  Telcordia LSI Engine: implementation and scalability issues , 2001, Proceedings Eleventh International Workshop on Research Issues in Data Engineering. Document Management for Data Intensive Business and Scientific Applications. RIDE 2001.

[25]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[26]  Tarek A. El-Ghazawi,et al.  Enterprise text processing: a sparse matrix approach , 2001, Proceedings International Conference on Information Technology: Coding and Computing.

[27]  Michiel E. Hochstenbach,et al.  A Jacobi-Davidson Type SVD Method , 2001, SIAM J. Sci. Comput..

[28]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[29]  Alistair Moffat,et al.  Searching large text collections , 2002 .

[30]  Masaki Aono,et al.  Matrix computations for information retrieval and major and outlier cluster detection , 2002 .

[31]  Michael W. Berry,et al.  GTP (General Text Parser) Software for Text Mining , 2003 .

[32]  Leo Grady,et al.  The Graph Analysis Toolbox: Image Processing on Arbitrary Graphs , 2003 .

[33]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[34]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[35]  Spiros Sirmakessis Text mining and its applications : results of the NEMIS Launch Conference , 2004 .

[36]  Yousef Saad,et al.  Polynomial filtering in latent semantic indexing for information retrieval , 2004, SIGIR '04.

[37]  Malú Castellanos HotMiner: Discovering Hot Topics from Dirty Text , 2004 .

[38]  Efstratios Gallopoulos,et al.  CLSI: A Flexible Approximation Scheme from Clustered Term-Document Matrices , 2005, SDM.

[39]  J. Quesada Creating Your Own LSA Spaces , 2007 .

[40]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[41]  Eun-Jin Im,et al.  Optimization of Sparse Matrix Kernels for Data Mining , 2007 .