A segment-based approach to clustering multi-topic documents

Document clustering has been recognized as a central problem in text data management. Such a problem becomes particularly challenging when document contents are characterized by subtopical discussions that are not necessarily relevant to each other. Existing methods for document clustering have traditionally assumed that a document is an indivisible unit for text representation and similarity computation, which may not be appropriate to handle documents with multiple topics. In this paper, we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments that are coherent with respect to the underlying subtopics. We propose a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our segment-based approach on large collections of multi-topic documents, and we compare it to conventional methods for document clustering.

[1]  Katherine A. Heller,et al.  A Nonparametric Bayesian Approach to Modeling Overlapping Clusters , 2007, AISTATS.

[2]  Raghu Krishnapuram,et al.  Fuzzy co-clustering of documents and keywords , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[3]  Evangelos E. Milios,et al.  Latent Dirichlet Co-Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[4]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[5]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[6]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[7]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[8]  J. Kogan Introduction to Clustering Large and High-Dimensional Data , 2007 .

[9]  Hiroshi Nakagawa,et al.  Bayesian Document Generative Model with Explicit Multiple Topics , 2007, EMNLP-CoNLL.

[10]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[11]  George M. Mohay,et al.  Multi-Topic E-mail Authorship Attribution Forensics , 2001 .

[12]  Joemon M. Jose,et al.  Text segmentation via topic modeling: an analytical study , 2009, CIKM.

[13]  Michael K. Ng,et al.  Knowledge-based vector space model for text clustering , 2010, Knowledge and Information Systems.

[14]  Naonori Ueda,et al.  Parametric mixture model for multitopic text , 2006 .

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Massih-Reza Amini,et al.  An extension of PLSA for document clustering , 2008, CIKM '08.

[17]  Qiang Fu,et al.  Bayesian Overlapping Subspace Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[18]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[19]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[20]  Michael K. Ng,et al.  Subspace Clustering of Text Documents with Feature Weighting K-Means Algorithm , 2005, PAKDD.

[21]  R. Krishnapuram,et al.  A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering , 1999, FUZZ-IEEE'99. 1999 IEEE International Fuzzy Systems. Conference Proceedings (Cat. No.99CH36315).

[22]  Ricardo Campos,et al.  WISE: Hierarchical Soft Clustering of Web Page Search Results Based on Web Content Mining Techniques , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[23]  Qi He,et al.  Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Huidong Jin,et al.  A segmented topic model based on the two-parameter Poisson-Dirichlet process , 2010, Machine Learning.

[25]  Lukasz A. Kurgan,et al.  Multi-label associative classification of medical documents from MEDLINE , 2005, Fourth International Conference on Machine Learning and Applications (ICMLA'05).

[26]  Mohamed S. Kamel,et al.  Statistical semantics for enhancing document clustering , 2011, Knowledge and Information Systems.

[27]  Zhi Lu,et al.  Short text clustering by finding core terms , 2011, Knowledge and Information Systems.

[28]  Palma Blonda,et al.  A survey of fuzzy clustering algorithms for pattern recognition. I , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[29]  Qiang Fu,et al.  Multiplicative Mixture Models for Overlapping Clustering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[30]  E. Milios,et al.  Model-based Overlapping Co-Clustering , 2006 .

[31]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[32]  Frank S. C. Tseng,et al.  An integration of fuzzy association rules and WordNet for document clustering , 2010, Knowledge and Information Systems.

[33]  Hiroyuki Kitagawa,et al.  A Novelty-based Clustering Method for On-line Documents , 2008, World Wide Web.

[34]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[35]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[36]  Yi Zhang,et al.  D2S: Document-to-sentence framework for novelty detection , 2011, Knowledge and Information Systems.

[37]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[38]  Naonori Ueda,et al.  Single-shot detection of multiple categories of text using parametric mixture models , 2002, KDD.

[39]  George Karypis,et al.  Soft clustering criterion functions for partitional document clustering: a summary of results , 2004, CIKM '04.

[40]  L. Sacks,et al.  Evaluating fuzzy clustering for relevance-based information access , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[41]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[42]  Tao Li,et al.  Document clustering via adaptive subspace iteration , 2004, SIGIR '04.

[43]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[44]  Xiaotie Deng,et al.  Efficient Phrase-Based Document Similarity for Clustering , 2008, IEEE Transactions on Knowledge and Data Engineering.

[45]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[46]  Xiaojun Wan,et al.  Towards a unified approach to document similarity search using manifold-ranking of blocks , 2008, Inf. Process. Manag..

[47]  Dawid Weiss,et al.  Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition , 2004, Intelligent Information Systems.

[48]  Thorsten Brants,et al.  Topic-based document segmentation with probabilistic latent semantic analysis , 2002, CIKM '02.

[49]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[50]  Jade Goldstein-Stewart,et al.  Selecting Text Spans for Document Summaries: Heuristics and Metrics , 1999, AAAI/IAAI.

[51]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[52]  Arindam Banerjee,et al.  Latent Dirichlet Conditional Naive-Bayes Models , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[53]  Sushmita Mitra,et al.  Web mining: a survey in the fuzzy framework , 2004, Fuzzy Sets Syst..

[54]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[55]  Joydeep Ghosh,et al.  Model-based overlapping clustering , 2005, KDD '05.

[56]  A. Nur Zincir-Heywood,et al.  Evaluation of Two Systems on Multi-class Multi-label Document Classification , 2005, ISMIS.

[57]  Palma Blonda,et al.  A survey of fuzzy clustering algorithms for pattern recognition. II , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[58]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[59]  Naonori Ueda,et al.  Parametric mixture model for multitopic text , 2006, Systems and Computers in Japan.

[60]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.