Clustering and Classification in Text Collections Using Graph Modularity

A new fast algorithm for clustering and classification of large collections of text documents is introduced. The new algorithm employs the bipartite graph that realizes the word-document matrix of the collection. Namely, the modularity of the bipartite graph is used as the optimization functional. Experiments performed with the new algorithm on a number of text collections had shown a competitive quality of the clustering (classification), and a record-breaking speed.

[1]  Massih-Reza Amini,et al.  An extension of PLSA for document clustering , 2008, CIKM '08.

[2]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[3]  U. Brandes,et al.  Maximizing Modularity is hard , 2006, physics/0608255.

[4]  Ana Margarida de Jesus,et al.  Improving Methods for Single-label Text Categorization , 2007 .

[5]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  Renaud Lambiotte,et al.  Multi-scale modularity in complex networks , 2010, 8th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks.

[7]  P. Bickel,et al.  A nonparametric view of network models and Newman–Girvan and other modularities , 2009, Proceedings of the National Academy of Sciences.

[8]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[9]  S. Fortunato,et al.  Resolution limit in community detection , 2006, Proceedings of the National Academy of Sciences.

[10]  Fei Wang,et al.  Regularized clustering for documents , 2007, SIGIR.

[11]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[12]  Chris H. Q. Ding,et al.  Bipartite graph partitioning and data clustering , 2001, CIKM '01.

[13]  Yaxin Bi,et al.  An kNN Model-Based Approach and Its Application in Text Categorization , 2004, CICLing.

[14]  J. Reichardt,et al.  Statistical mechanics of community detection. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[17]  Maria P. Grineva,et al.  Extracting key terms from noisy and multitheme documents , 2009, WWW '09.

[18]  Shlomo Geva,et al.  Document Clustering with K-tree , 2008, INEX.

[19]  Michael W. Berry,et al.  Survey of Text Mining , 2003, Springer New York.

[20]  E. Polak Introduction to linear and nonlinear programming , 1973 .

[21]  M. Barber Modularity and community detection in bipartite networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[22]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[23]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.