CDIM: Document Clustering by Discrimination Information Maximization

Ideally, document clustering methods should produce clusters that are semantically relevant and readily understandable as collections of documents belonging to particular contexts or topics. However, existing popular document clustering methods often ignore term-document corpus-based semantics while relying upon generic measures of similarity. In this paper, we present CDIM, an algorithmic framework for partitional clustering of documents that maximizes the sum of the discrimination information provided by documents. CDIM exploits the semantic that term discrimination information provides better understanding of contextual topics than term-to-term relatedness to yield clusters that are describable by their highly discriminating terms. We evaluate the proposed clustering algorithm using well-known discrimination/semantic measures including Relative Risk (RR), Measurement of Discrimination Information (MDI), Domain Relevance (DR), and Domain Consensus (DC) on twelve data sets to prove that CDIM produces high-quality clusters comparable to the best methods. We also illustrate the understandability and efficiency of CDIM, suggesting its suitability for practical document clustering.

[1]  M. LeBlanc,et al.  Relative risk trees for censored survival data. , 1992, Biometrics.

[2]  Michael K. Ng,et al.  A Comparative Study of Ontology Based Term Similarity Measures on PubMed Document Clustering , 2007, DASFAA.

[3]  Soon Myoung Chung,et al.  Text document clustering based on frequent word meaning sequences , 2008, Data Knowl. Eng..

[4]  Asim Karim,et al.  Clustering and Understanding Documents via Discrimination Information Maximization , 2012, PAKDD.

[5]  Yuan Yan Tang,et al.  Document Clustering in Correlation Similarity Measure Space , 2012, IEEE Transactions on Knowledge and Data Engineering.

[6]  Charles F. Manski,et al.  Estimation of Response Probabilities From Augmented Retrospective Observations , 1985 .

[7]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[8]  Haesun Park,et al.  Fast rank-2 nonnegative matrix factorization for hierarchical document clustering , 2013, KDD.

[9]  Dmitriy Fradkin,et al.  Single pass text classification by direct feature weighting , 2011, Knowledge and Information Systems.

[10]  Henry Anaya-Sánchez,et al.  A document clustering algorithm for discovering and describing topics , 2010, Pattern Recognit. Lett..

[11]  Jiawei Han,et al.  Locally Consistent Concept Factorization for Document Clustering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[12]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[13]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[15]  C. J. van Rijsbergen,et al.  Learning semantic relatedness from term discrimination information , 2009, Expert Syst. Appl..

[16]  Jinyan Li,et al.  Relative risk and odds ratio: a data mining perspective , 2005, PODS '05.

[17]  Mehmed Kantardzic,et al.  Data-Mining Concepts , 2011 .

[18]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[19]  Gerard Salton Some hierarchical models for automatic document retrieval , 1963 .

[20]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[21]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[22]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[23]  Jinyan Li,et al.  Mining statistically important equivalence classes and delta-discriminative emerging patterns , 2007, KDD '07.

[24]  Malcolm I. Heywood,et al.  Comparing Dimension Reduction Techniques for Document Clustering , 2005, Canadian Conference on AI.

[25]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[26]  Michael Halliday,et al.  Cohesion in English , 1976 .

[27]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[28]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[29]  Yihong Gong,et al.  Document clustering by concept factorization , 2004, SIGIR '04.

[30]  FayyadUsama,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005 .

[31]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2008, Information Retrieval.

[32]  王尧,et al.  A three-phase approach to document clustering based on topic significance degree , 2014 .

[33]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.

[34]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[35]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[36]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[37]  Reynaldo Gil-García,et al.  Dynamic hierarchical algorithms for document clustering , 2010, Pattern Recognit. Lett..

[38]  Di Cai,et al.  An Information-Theoretic Foundation for the Measurement of Discrimination Information , 2010, IEEE Transactions on Knowledge and Data Engineering.

[39]  Dale Schuurmans,et al.  Combining Naive Bayes and n-Gram Language Models for Text Classification , 2003, ECIR.

[40]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[41]  Edward A. Fox,et al.  Recent Developments in Document Clustering , 2007 .

[42]  Xiangfeng Luo,et al.  Measuring the semantic discrimination capability of association relations , 2014, Concurr. Comput. Pract. Exp..

[43]  Iraklis Varlamis,et al.  Semantic smoothing for text clustering , 2013, Knowl. Based Syst..

[44]  Kenneth Ward Church,et al.  - 1-What ’ s Wrong with Adding One ? , 1994 .

[45]  Graeme Hirst,et al.  Non-Classical Lexical Semantic Relations , 2004, Proceedings of the HLT-NAACL Workshop on Computational Lexical Semantics - CLS '04.

[46]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[47]  Jae Yun Lee,et al.  A corpus-based approach to comparative evaluation of statistical term association measures , 2001, J. Assoc. Inf. Sci. Technol..

[48]  Takeo Kanade,et al.  Discriminative cluster analysis , 2006, ICML.

[49]  Asim Karim,et al.  Impact of Behavior Clustering on Web Surfer Behavior Prediction , 2011, J. Inf. Sci. Eng..

[50]  William A. Gale,et al.  Good-Turing Frequency Estimation Without Tears , 1995, J. Quant. Linguistics.

[51]  Asim Karim,et al.  A Robust Discriminative Term Weighting Based Linear Discriminant Method for Text Classification , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[52]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[53]  Gary M. Weiss,et al.  Quantification and semi-supervised classification methods for handling changes in class distribution , 2009, KDD.

[54]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[55]  George Karypis,et al.  Document Clustering: The Next Frontier , 2018, Data Clustering: Algorithms and Applications.

[56]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[57]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[58]  Xijin Tang,et al.  Text clustering using frequent itemsets , 2010, Knowl. Based Syst..

[59]  Paola Velardi,et al.  Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites , 2004, CL.