On the use of linear programming for unsupervised text classification

We propose a new algorithm for dimensionality reduction and unsupervised text classification. We use mixture models as underlying process of generating corpus and utilize a novel, L1-norm based approach introduced by Kleinberg and Sandler [19]. We show that our algorithm performs extremely well on large datasets, with peak accuracy approaching that of supervised learning based on Support Vector Machines (SVMs) with large training sets. The method is based on the same idea that underlies Latent Semantic Indexing (LSI). We find a good low-dimensional subspace of a feature space and project all documents into it. However our projection minimizes different error, and unlike LSI we build a basis, that in many cases corresponds to the actual topics. We present results of testing of our algorithm on the abstracts of arXiv - an electronic repository of scientific papers, and the 20 Newsgroup dataset - a small snapshot of 20 specific newsgroups.

[1]  Chaitanya Swamy,et al.  Correlation Clustering: maximizing agreements via semidefinite programming , 2004, SODA '04.

[2]  Golub Gene H. Et.Al Matrix Computations, 3rd Edition , 2007 .

[3]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[4]  Anna R. Karlin,et al.  Spectral analysis of data , 2001, STOC '01.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Alan M. Frieze,et al.  High Degree Vertices and Eigenvalues in the Preferential Attachment Graph , 2005, Internet Math..

[8]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[9]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[10]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[11]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[12]  Christos H. Papadimitriou,et al.  On the Eigenvalue Power Law , 2002, RANDOM.

[13]  Chris Ding,et al.  On the Use of Singular Value Decomposition for Text Retrieval , 2000 .

[14]  Nicole Immorlica,et al.  Approximation, Randomization, and Combinatorial Optimization.. Algorithms and Techniques , 2003, Lecture Notes in Computer Science.

[15]  Fan Chung Graham,et al.  The Spectra of Random Graphs with Given Expected Degrees , 2004, Internet Math..

[16]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[17]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[18]  Jon M. Kleinberg,et al.  Using mixture models for collaborative filtering , 2004, STOC '04.

[19]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[20]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[21]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[22]  Eric Saund,et al.  Applying the Multiple Cause Mixture Model to Text Categorization , 1996, ICML.

[23]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[24]  T. Kanade,et al.  Robust subspace computation using L1 norm , 2003 .

[25]  Anirban Dasgupta,et al.  Spectral analysis of random graphs with skewed degree distributions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[26]  Eric Saund,et al.  A Multiple Cause Mixture Model for Unsupervised Learning , 1995, Neural Computation.

[27]  Inderjit S. Dhillon,et al.  Enhanced word clustering for hierarchical text classification , 2002, KDD.

[28]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .