Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

Clusters of text documents output by clustering algorithms are often hard to interpret. We describe motivating real-world scenarios that necessitate reconfigurability and high interpretability of clusters and outline the problem of generating clusterings with interpretable and reconfigurable cluster models. We develop two clustering algorithms toward the outlined goal of building interpretable and reconfigurable cluster models. They generate clusters with associated rules that are composed of conditions on word occurrences or nonoccurrences. The proposed approaches vary in the complexity of the format of the rules; RGC employs disjunctions and conjunctions in rule generation whereas RGC-D rules are simple disjunctions of conditions signifying presence of various words. In both the cases, each cluster is comprised of precisely the set of documents that satisfy the corresponding rule. Rules of the latter kind are easy to interpret, whereas the former leads to more accurate clustering. We show that our approaches outperform the unsupervised decision tree approach for rule-generating clustering and also an approach we provide for generating interpretable models for general clusterings, both by significant margins. We empirically show that the purity and f-measure losses to achieve interpretability can be as little as 3 and 5%, respectively using the algorithms presented herein.

[1]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[2]  Alberto O. Mendelzon,et al.  Concise descriptions of subsets of structured sets , 2003, PODS.

[3]  Shourya Roy Scaled Entropy and DF-SE : Different and Improved Unsupervised Feature Selection Techniques for Text Clustering , 2006 .

[4]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[5]  Hisao Ishibuchi,et al.  Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning , 2007, Int. J. Approx. Reason..

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[8]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[9]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[10]  Keke Chen,et al.  “Best K”: critical clustering structures in categorical datasets , 2008, Knowledge and Information Systems.

[11]  Ebrahim H. Mamdani,et al.  An Experiment in Linguistic Synthesis with a Fuzzy Logic Controller , 1999, Int. J. Hum. Comput. Stud..

[12]  Michael K. Ng,et al.  Knowledge-based vector space model for text clustering , 2010, Knowledge and Information Systems.

[13]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[14]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[15]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[16]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[17]  Detlef D. Nauck,et al.  Measuring interpretability in rule-based classification systems , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[18]  Michalis Vazirgiannis,et al.  A Review of Web Document Clustering Approaches , 2010, Data Mining and Knowledge Discovery Handbook.

[19]  Yoon Ho Cho,et al.  A personalized recommender system based on web usage mining and decision tree induction , 2002, Expert Syst. Appl..

[20]  Mohamed S. Kamel,et al.  Statistical semantics for enhancing document clustering , 2011, Knowledge and Information Systems.

[21]  Derek Greene,et al.  Producing Accurate Interpretable Clusters from High-Dimensional Data , 2005, PKDD.

[22]  Yoram Singer,et al.  A simple, fast, and effective rule learner , 1999, AAAI 1999.

[23]  Guang R. Gao,et al.  An adaptive meta-clustering approach: combining the information from different clustering results , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[24]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[25]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[26]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[27]  Furu Wei,et al.  A document-sensitive graph model for multi-document summarization , 2010, Knowledge and Information Systems.

[28]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[29]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[30]  Daniel Boley,et al.  Hierarchical Taxonomies using Divisive Partitioning , 1998 .

[31]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[32]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[33]  Deepak Khemani,et al.  Interpretable and reconfigurable clustering of document datasets by deriving word-based rules , 2009, Knowledge and Information Systems.

[34]  Raghu Krishnapuram,et al.  Automatic Taxonomy Generation: Issues and Possibilities , 2003, IFSA.

[35]  Martin Ester,et al.  Cluster Description Formats, Problems and Algorithms , 2006, SDM.

[36]  Bjørn K. Alsberg,et al.  Fast, fuzzy c‐means clustering of data sets with many features , 1995, J. Comput. Chem..

[37]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[38]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[39]  Sholom M. Weiss,et al.  Lightweight Rule Induction , 2000, ICML.

[40]  Jean Véronis,et al.  Book reviews: Polysemy: theoretical and computational approaches , 2002 .

[41]  R. Michalski,et al.  Learning from Observation: Conceptual Clustering , 1983 .

[42]  Martin Halvey,et al.  An assessment of tag presentation techniques , 2007, WWW '07.

[43]  Jayanta Basak,et al.  Interpretable hierarchical clustering by constructing an unsupervised decision tree , 2005, IEEE Transactions on Knowledge and Data Engineering.

[44]  Sachindra Joshi,et al.  A matrix density based algorithm to hierarchically co-cluster documents and words , 2003, WWW '03.

[45]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[46]  Philip S. Yu,et al.  Clustering through decision tree construction , 2000, CIKM '00.

[47]  Laks V. S. Lakshmanan,et al.  The Generalized MDL Approach for Summarization , 2002, VLDB.

[48]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[49]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[50]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.