Concept Indexing: A Fast Dimensionality Reduction Algorithm With Applications to Document Retrieval and Categorization

Abstract : In recent years, we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. This has led to an increased interest in developing methods that can efficiently categorize and retrieve relevant information. Retrieval techniques based on dimensionality reduction, such as Latent Semantic Indexing (LSI), have been shown to improve the quality of the information being retrieved by capturing the latent meaning of the words present in the documents. Unfortunately, the high computational requirements of LSI and its inability to compute an effective dimensionality reduction in a supervised setting limits its applicability. In this paper we present a fast dimensionality reduction algorithm, called concept indexing (CI) that is equally effective for unsupervised and supervised dimensionality reduction. CI computes a k-dimensional representation of a collection of documents by first clustering the documents into k groups, and then using the centroid vectors of the clusters to derive the axes of the reduced k-dimensional space. Experimental results show that the dimensionality reduction computed by CI achieves comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less time. Moreover, when CI is used to compute the dimensionality reduction in a supervised setting, it greatly improves the performance of traditional classification algorithms such as C4.5 and kNN.

[1]  Se June Hong,et al.  Use of Contextaul Information for Feature Ranking and Discretization , 1997, IEEE Trans. Knowl. Data Eng..

[2]  Vipin Kumar,et al.  WebACE: a Web agent for document categorization and exploration , 1998, AGENTS '98.

[3]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[4]  Paul Thompson,et al.  Automatic Categorization of Statute Documents , 1997 .

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Susan T. Dumais,et al.  Using LSI for information filtering: TREC-3 experiments , 1995 .

[7]  Laveen N. Kanal Book review: Search in Artificial Intelligence Ed. by Laveen Kanal and Vipin Kumar (Springer-Verlag, New York, 1988) , 1991, SGAR.

[8]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[9]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[10]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[11]  Michael W. Berry,et al.  SVDPACKC (Version 1.0) User''s Guide , 1993 .

[12]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[13]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[14]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[15]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[16]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[17]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[18]  Richard A. Harshman,et al.  Indexing by latent semantic indexing , 1990 .

[19]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[20]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[21]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[22]  J. Edward Jackson,et al.  A User's Guide to Principal Components. , 1991 .

[23]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[24]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[25]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[26]  Paul S. Bradley,et al.  Initialization of Iterative Refinement Clustering Algorithms , 1998, KDD.

[27]  George Karypis,et al.  Centroid-Based Document Classification Algorithms: Analysis & Experimental Results , 2000 .

[28]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[29]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[30]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[31]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[32]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[33]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[34]  Ron Kohavi,et al.  Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology , 1995, KDD.

[35]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[36]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[37]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[38]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[39]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[40]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[41]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[42]  James C. French,et al.  Clustering large datasets in arbitrary metric spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[43]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[44]  William F. Punch,et al.  Finding Salient Features for Personal Web Page Categories , 1997, Comput. Networks.

[45]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[46]  T DumaisSusan,et al.  Using linear algebra for intelligent information retrieval , 1995 .

[47]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[48]  David G. Lowe,et al.  Similarity Metric Learning for a Variable-Kernel Classifier , 1995, Neural Computation.

[49]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[50]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[51]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[52]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[53]  Robert H. Gross,et al.  Web Page Categorization and Feature Selection Using Association Rule and Principal Component Cluster , 1997 .

[54]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[55]  Philip S. Yu,et al.  On the merits of building categorization systems by supervised clustering , 1999, KDD '99.

[56]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[57]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[58]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[59]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[60]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[61]  Walter Daelemans,et al.  Learnability and markedness in data-driven acquisition of stress , 1993 .

[62]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[63]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .