Relationship-based clustering and cluster ensembles for high-dimensional data mining

This dissertation takes a relationship-based approach to cluster analysis of high (1000 and more) dimensional data that side-steps the ‘curse of dimensionality’ issue by working in a suitable similarity space instead of the original feature space. We propose two frameworks that leverage graph algorithms to achieve relationship-based clustering and visualization, respectively. In the visualization framework, the output from the clustering algorithm is used to reorder the data points so that the resulting permuted similarity matrix can be readily visualized in 2 dimensions, with clusters showing up as bands. Results on retail transaction, document (bag-of-words), and web-log data show that our approach can yield superior results while also taking additional balance constraints into account. The choice of similarity is a critical step in relationship-based clustering and this motivates our systematic comparative study of the impact of similarity measures on the quality of document clusters . The key findings of our experimental study are: (i) Cosine, correlation, and extended Jaccard similarities perform comparably; (ii) Euclidean distances do not work well; (iii) graph partitioning tends to be superior to k-means and SOMs especially when balanced clusters are desired; and (iv) performance curves generally do not cross. We also propose a cluster quality evaluation measure based on normalized mutual information and find an analytical relation between similarity measures. It is widely recognized that combining multiple classification or regression models typically provides superior results compared to using a single, well-tuned model. However, there are no well known approaches to combining multiple clusterings. The idea of combining cluster labelings without accessing the original features leads to a general knowledge reuse framework that we call cluster ensembles. We propose a formal definition of the cluster ensemble as an optimization problem. Taking a relationship-based approach we propose three effective and efficient combining algorithms for solving it heuristically based on a hypergraph model. Results on synthetic as well as real data-sets show that cluster ensembles can (i) improve quality and robustness, and (ii) enable distributed clustering, and (iii) speed up processing significantly with little loss in quality.

[1]  George Kingsley Zipf,et al.  Relative Frequency as a Determinant of Phonetic Change , 1930 .

[2]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[3]  Brian W. Kernighan,et al.  An efficient heuristic procedure for partitioning graphs , 1970, Bell Syst. Tech. J..

[4]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[5]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6]  Alex Pothen,et al.  PARTITIONING SPARSE MATRICES WITH EIGENVECTORS OF GRAPHS* , 1990 .

[7]  Tzay Y. Young,et al.  Classification, Estimation and Pattern Recognition , 1974 .

[8]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[9]  M. Fiedler A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory , 1975 .

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[12]  Temple F. Smith Occam's razor , 1980, Nature.

[13]  G. W. Milligan,et al.  A Review Of Monte Carlo Tests Of Cluster Analysis. , 1981, Multivariate behavioral research.

[14]  Jeffrey A. Barnett,et al.  Computational Methods for a Mathematical Theory of Evidence , 1981, IJCAI.

[15]  A. Cohen,et al.  Finite Mixture Distributions , 1982 .

[16]  R. M. Mattheyses,et al.  A Linear-Time Heuristic for Improving Network Partitions , 1982, 19th Design Automation Conference.

[17]  Mark A. Gluck,et al.  Information, Uncertainty and the Utility of Categories , 1985 .

[18]  Edward Rolf Tufte,et al.  The visual display of quantitative information , 1985 .

[19]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[20]  Fionn Murtagh,et al.  Multidimensional clustering algorithms , 1985 .

[21]  D. A. Neumann,et al.  Clustering and isolation in the consensus problem for partitions , 1986 .

[22]  D. A. Neumann,et al.  On lattice consensus methods , 1986 .

[23]  J. Barthelemy,et al.  On the use of ordered sets in problems of comparison and consensus of classifications , 1986 .

[24]  Douglas H. Fisher,et al.  Improving Inference through Conceptual Clustering , 1987, AAAI.

[25]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[26]  Stephen Grossberg,et al.  The ART of adaptive pattern recognition by a self-organizing neural network , 1988, Computer.

[27]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[28]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[29]  Robert W. Irving,et al.  The Stable marriage problem - structure and algorithms , 1989, Foundations of computing series.

[30]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[31]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[32]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[33]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[34]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[35]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[36]  Joydeep Ghosh,et al.  Evidence combination techniques for robust classification of short-duration oceanic signals , 1992, Defense, Security, and Sensing.

[37]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[38]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[39]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[40]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[41]  Bruce Hendrickson,et al.  The Chaco user`s guide. Version 1.0 , 1993 .

[42]  M. Perrone Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization , 1993 .

[43]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[44]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[45]  Rich Caruana,et al.  Learning Many Related Tasks at the Same Time with Backpropagation , 1994, NIPS.

[46]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[47]  Harris Drucker,et al.  Boosting and Other Ensemble Methods , 1994, Neural Computation.

[48]  Belur V. Dasarathy,et al.  Decision fusion , 1994 .

[49]  Lorien Y. Pratt,et al.  Experiments on the transfer of knowledge between neural networks , 1994, COLT 1994.

[50]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[51]  Jerome H. Friedman,et al.  An Overview of Predictive Learning and Function Approximation , 1994 .

[52]  Gary L. Miller,et al.  Geometric mesh partitioning: implementation and experiments , 1995, Proceedings of 9th International Parallel Processing Symposium.

[53]  Sampath Kannan,et al.  Computing the local consensus of trees , 1995, SODA '95.

[54]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[55]  Bruce Hendrickson,et al.  An Improved Spectral Graph Partitioning Algorithm for Mapping Parallel Computations , 1995, SIAM J. Sci. Comput..

[56]  Sebastian Thrun,et al.  Explanation-based neural network learning a lifelong learning approach , 1995 .

[57]  Anil K. Jain,et al.  Artificial neural networks for feature extraction and multivariate data projection , 1995, IEEE Trans. Neural Networks.

[58]  Andrew B. Kahng,et al.  Recent directions in netlist partitioning: a survey , 1995, Integr..

[59]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[60]  Salvatore J. Stolfo,et al.  A Comparative Evaluation of Voting and Meta-learning on Partitioned Data , 1995, ICML.

[61]  Joydeep Ghosh,et al.  Scale-based clustering using the radial basis function network , 1996, IEEE Trans. Neural Networks.

[62]  Kishan G. Mehrotra,et al.  Elements of artificial neural networks , 1996 .

[63]  Kagan Tumer,et al.  Analysis of decision boundaries in linearly combined neural classifiers , 1996, Pattern Recognit..

[64]  Sebastian Thrun,et al.  Discovering Structure in Multiple Learning Tasks: The TC Algorithm , 1996, ICML.

[65]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[66]  Prabhakar Raghavan,et al.  Sparse matrix reordering schemes for browsing hypertext , 1996 .

[67]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[68]  Daniel L. Silver,et al.  The Parallel Transfer of Task Knowledge Using Dynamic Learning Rates Based on a Measure of Relatedness , 1996, Connect. Sci..

[69]  Hans-Peter Kriegel,et al.  Visualization Techniques for Mining Large Databases: A Comparison , 1996, IEEE Trans. Knowl. Data Eng..

[70]  Padhraic Smyth,et al.  Clustering Using Monte Carlo Cross-Validation , 1996, KDD.

[71]  Sherif Hashem,et al.  Optimal Linear Combinations of Neural Networks , 1997, Neural Networks.

[72]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.

[73]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[74]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[75]  Tamara G. Kolda,et al.  Limited-memory matrix methods with applications , 1997 .

[76]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[77]  Inderjit S. Dhillon,et al.  Visualizing Class Structure of Multidimensional Data , 1998 .

[78]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[79]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[80]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[81]  Joydeep Ghosh,et al.  A Supra-Classifier Architecture for Scalable Knowledge Reuse , 1998, ICML.

[82]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[83]  Paul S. Bradley,et al.  Initialization of Iterative Refinement Clustering Algorithms , 1998, KDD.

[84]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[85]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[86]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[87]  Vipin Kumar,et al.  A Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering , 1998, J. Parallel Distributed Comput..

[88]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[89]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[90]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[91]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[92]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[93]  Joydeep Ghosh,et al.  Distance based clustering of association rules , 1999 .

[94]  Chaomei Chen,et al.  Visualising Semantic Spaces and Author Co-Citation Networks in Digital Libraries , 1999, Inf. Process. Manag..

[95]  Kagan Tumer,et al.  Linear and Order Statistics Combiners for Pattern Classification , 1999, ArXiv.

[96]  Hillol Kargupta,et al.  Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[97]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[98]  Kyuseok Shim,et al.  Scalable algorithms for mining large databases , 1999, KDD '99.

[99]  Piotr Indyk A sublinear time approximation scheme for clustering in metric spaces , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[100]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[101]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[102]  Joydeep Ghosh,et al.  GAMLS: a generalized framework for associative modular learning systems , 1999, Defense, Security, and Sensing.

[103]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[104]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[105]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[106]  Junhyong Kim,et al.  Tutorial on Phylogenetic Tree Estimation , 1999, ISMB 1999.

[107]  J. Aggarwal,et al.  Detecting moving objects in airborne forward looking infra-red sequences , 1999, Proceedings IEEE Workshop on Computer Vision Beyond the Visible Spectrum: Methods and Applications (CVBVS'99).

[108]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[109]  Joydeep Ghosh,et al.  Effective supra-classifiers for knowledge base construction , 1999, Pattern Recognit. Lett..

[110]  Shashi Shekhar,et al.  Multilevel hypergraph partitioning: applications in VLSI domain , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[111]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[112]  Daryl E. Hershberger,et al.  Collective Data Mining: a New Perspective toward Distributed Data Mining Advances in Distributed Data Mining Book , 1999 .

[113]  Loriene Roy,et al.  Content-based book recommending using learning for text categorization , 1999, DL '00.

[114]  Clustering Guidance And Quality Evaluation Using Relationship-Based Visualization , 2000 .

[115]  Joydeep Ghosh,et al.  Value-based customer grouping from large retail data sets , 2000, SPIE Defense + Commercial Sensing.

[116]  Jake K. Aggarwal,et al.  MODEEP: a motion-based object detection and pose estimation method for airborne FLIR sequences , 2000, Machine Vision and Applications.

[117]  Thomas G. Dietterich Ensemble Methods in Machine Learning , 2000, Multiple Classifier Systems.

[118]  W. Scott Spangler,et al.  Clustering hypertext with applications to web searching , 2000, HYPERTEXT '00.

[119]  Vipin Kumar,et al.  Parallel Multilevel Algorithms for Multi-constraint Graph Partitioning (Distinguished Paper) , 2000, Euro-Par.

[120]  Joydeep Ghosh,et al.  A S alable Approa h to Balan ed, High-dimensional Clustering of Market-baskets , 2000 .

[121]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[122]  Philip K. Chan,et al.  Advances in Distributed and Parallel Knowledge Discovery , 2000 .

[123]  Jake K. Aggarwal,et al.  A new Bayesian relaxation framework for the estimation and segmentation of multiple motions , 2000, 4th IEEE Southwest Symposium on Image Analysis and Interpretation.

[124]  Joydeep Ghosh,et al.  Modular learning through output space decomposition , 2000 .

[125]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[126]  Joydeep Ghosh,et al.  A Unified Model for Probabilistic Principal Surfaces , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[127]  Branko Kavsek,et al.  Consensus Decision Trees: Using Consensus Hierarchical Clustering for Data Relabelling and Reduction , 2001, ECML.

[128]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[129]  Anthony K. H. Tung,et al.  Spatial clustering methods in data mining : A survey , 2001 .

[130]  Joydeep Ghosh,et al.  Detecting Seasonal Trends and Cluster Motion Visualization for Very High Dimensional Transactional Data , 2001, SDM.

[131]  Thomas Ragg Building Committees by Clustering Models Based on Pairwise Similarity Values , 2001, ECML.

[132]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[133]  Joydeep Ghosh,et al.  Relationship-based Visualization of High-dimensional Data Clusters , 2001 .

[134]  Arindam Banerjee,et al.  Clickstream clustering using weighted longest common subsequences , 2001 .

[135]  Craig A. Knoblock,et al.  Selective Sampling + Semi-supervised Learning = Robust Multi-View Learning , 2001 .

[136]  Joydeep Ghosh,et al.  On Scaling Up Balanced Clustering Algorithms , 2002, SDM.

[137]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[138]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[139]  Jeffrey S. Simonoff,et al.  Tree Induction Vs Logistic Regression: A Learning Curve Analysis , 2001, J. Mach. Learn. Res..

[140]  B. Jaumard,et al.  Cluster Analysis and Mathematical Programming , 2003 .

[141]  Joydeep Ghosh,et al.  Relationship-Based Clustering and Visualization for High-Dimensional Data Mining , 2003, INFORMS J. Comput..

[142]  Vladimir Kotlyar,et al.  Personalization of Supermarket Product Recommendations , 2004, Data Mining and Knowledge Discovery.

[143]  G. Karypis,et al.  Clustering In A High-Dimensional Space Using Hypergraph Models , 2004 .

[144]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[145]  A. Karimi,et al.  Master‟s thesis , 2011 .

[146]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[147]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[148]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[149]  Boris G. Mirkin,et al.  Reinterpreting the Category Utility Function , 2001, Machine Learning.

[150]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.