A Survey of Text Clustering Algorithms

Clustering is a widely studied data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organization, and indexing. In this chapter, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the text domain. We will discuss the key methods used for text clustering, and their relative advantages. We will also discuss a number of recent advances in the area in the context of social network and linked data.

[1]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[2]  W. Bruce Croft,et al.  Document clustering: An evaluation of some experiments with the cranfield 1400 collection , 1975, Inf. Process. Manag..

[3]  W. Bruce Croft Clustering large files of documents using the single-link method , 1977, J. Am. Soc. Inf. Sci..

[4]  Peter Willett,et al.  Document clustering using an inverted file approach , 1980 .

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[7]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[8]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[9]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[10]  Pat Langley,et al.  Models of Incremental Concept Formation , 1990, Artif. Intell..

[11]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[12]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[13]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[14]  W. John Wilbur,et al.  The automatic identification of stop words , 1992, J. Inf. Sci..

[15]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[16]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[17]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[18]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[19]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[20]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[21]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[22]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[23]  Hang Li,et al.  Document Classification Using a Finite Mixture Model , 1997, ACL.

[24]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[25]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[26]  Jan O. Pedersen,et al.  Almost-constant-time clustering of arbitrary corpus subsets4 , 1997, SIGIR '97.

[27]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.

[28]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[30]  Shivakumar Vaithyanathan,et al.  Exploiting clustering and phrases for context-based information retrieval , 1997, SIGIR '97.

[31]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[32]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[33]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[34]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[35]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[36]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[37]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[38]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[39]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[40]  Sang-goo Lee,et al.  A semi-supervised document clustering technique for information organization , 2000, CIKM '00.

[41]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[42]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[43]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD 2000.

[44]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[45]  Huan Liu,et al.  Feature Selection for Clustering , 2000, Encyclopedia of Database Systems.

[46]  Thomas de Quincey [C] , 2000, The Works of Thomas De Quincey, Vol. 1: Writings, 1799–1820.

[47]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[48]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[49]  Ran El-Yaniv,et al.  On feature distributional clustering for text categorization , 2001, SIGIR '01.

[50]  Martin Franz,et al.  Unsupervised and supervised clustering for topic tracking , 2001, SIGIR '01.

[51]  Ran El-Yaniv,et al.  Iterative Double Clustering for Unsupervised and Semi-supervised Learning , 2001, ECML.

[52]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[53]  Philip S. Yu,et al.  On effective conceptual indexing and similarity search in text data , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[54]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[55]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[56]  I. Jolliffe Principal Component Analysis , 2002 .

[57]  Chris H. Q. Ding,et al.  Adaptive dimension reduction for clustering high dimensional data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[58]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.

[59]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[60]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[61]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[62]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[63]  Dominic Widdows,et al.  Discovering Corpus-Specific Word Senses , 2003, EACL.

[64]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[65]  Ata Kabán,et al.  On an equivalence between PLSI and LDA , 2003, SIGIR.

[66]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[67]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[68]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[69]  Yihong Gong,et al.  Document clustering by concept factorization , 2004, SIGIR '04.

[70]  Tao Li,et al.  Document clustering via adaptive subspace iteration , 2004, SIGIR '04.

[71]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[72]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[73]  Philip S. Yu,et al.  On using partial supervision for text categorization , 2004, IEEE Transactions on Knowledge and Data Engineering.

[74]  Tom Michael Mitchell,et al.  The Role of Unlabeled Data in Supervised Learning , 2004 .

[75]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[76]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[77]  Yiming Yang,et al.  A Probabilistic Model for Online Document Clustering with Application to Novelty Detection , 2004, NIPS.

[78]  Hector Garcia-Molina,et al.  Overview of multidatabase transaction management , 2005, The VLDB Journal.

[79]  Shi Zhong,et al.  Efficient streaming text clustering , 2005, Neural Networks.

[80]  Chris H. Q. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering , 2005, SDM.

[81]  Philip S. Yu,et al.  Parameter Free Bursty Events Detection in Text Streams , 2005, VLDB.

[82]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[83]  Farshad Fotouhi,et al.  Co-clustering Documents and Words Using Bipartite Isoperimetric Graph Partitioning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[84]  Naftali Tishby,et al.  The Power of Word Clusters for Text Classification , 2006 .

[85]  Stefan Siersdorfer,et al.  A neighborhood-based approach for clustering of linked document collections , 2006, CIKM '06.

[86]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[87]  Ramayya Krishnan,et al.  Incremental hierarchical clustering of text documents , 2006, CIKM '06.

[88]  Philip S. Yu,et al.  A Framework for Clustering Massive Text and Categorical Data Streams , 2006, SDM.

[89]  Xiang Ji,et al.  Document clustering with prior knowledge , 2006, SIGIR.

[90]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[91]  Tom M. Mitchell,et al.  Text clustering with extended user feedback , 2006, SIGIR.

[92]  Fei Wang,et al.  Regularized clustering for documents , 2007, SIGIR.

[93]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[94]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[95]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval , 2008, NAACL.

[96]  Qi He,et al.  Bursty Feature Representation for Clustering Text Streams , 2007, SDM.

[97]  Xiaohua Hu,et al.  A comparative evaluation of different link types on enhancing document clustering , 2008, SIGIR '08.

[98]  Chris H. Q. Ding,et al.  Knowledge transformation from word space to document space , 2008, SIGIR '08.

[99]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[100]  Jian Yin,et al.  Clustering Text Data Streams , 2008, Journal of Computer Science and Technology.

[101]  Weimao Ke,et al.  Dynamicity vs. effectiveness: studying online clustering for scatter/gather , 2009, SIGIR.

[102]  Yizhou Sun,et al.  iTopicModel: Information Network-Integrated Topic Modeling , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[103]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[104]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[105]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[106]  Yun Chi,et al.  Combining link and content for community detection: a discriminative approach , 2009, KDD.

[107]  Kai Wang,et al.  Prototype hierarchy based clustering for the categorization and navigation of web collections , 2010, SIGIR.

[108]  Yue Lu,et al.  Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA , 2011, Information Retrieval.

[109]  Ricardo Baeza-Yates,et al.  Modern Information Retrieval - the concepts and technology behind search, Second edition , 2011 .

[110]  Dan Zhang,et al.  Document clustering with universum , 2011, SIGIR.

[111]  Philip S. Yu,et al.  On Text Clustering with Side Information , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[112]  Charu C. Aggarwal,et al.  Community Detection with Edge Content in Social Media Networks , 2012, 2012 IEEE 28th International Conference on Data Engineering.