Clustering of multi-domain information networks

Clustering is one of the most basic mental activities used by humans to handle the huge amount of information they receive every day. As such, clustering has been extensively studied in different disciplines including: statistics, pattern recognition, machine learning and data mining. Nevertheless, the body of knowledge concerning clustering has focused on objects represented as feature vectors stored in a single dataset. Clustering in this setting aims at grouping objects of a single type in a single table into clusters using the feature vectors. On the other hand, modern real-world applications are composed of multiple, large interrelated datasets comprising distinct attribute sets and containing objects from many domains; typically such data is stored in an information network. The types of patterns and knowledge desired in these applications goes far beyond grouping similar homogenous objects, but rather involves unveiling dependency structures in the data in addition to pinpointing hidden associations across objects in multiple datasets and domains. For example consider an information network that contains the domains of authors, papers and conferences. Two authors a1 and a2 may work in the same research field but never publish in the same conference. Hence clustering only the domains of authors and conference would fail to place a1 and a2 in the same cluster; however considering the entire information network would reveal a hidden link via the papers domain, placing a1 and a2 in the same cluster. Notice, that knowledge discovery in the preceding example was derived by clustering objects based on their relation to other objects, as opposed to grouping objects based on their attributes. This form of relational clustering is essential for knowledge discovery in several applications: bioinformatics: exploring the clusters among the domains of genes, diseases, drugs, and patients; social networking: segmenting customers based on friendship relations, social groups, and demographics; and recommender systems: leveraging user ratings, product ratings, product functionality, and blog entries to cluster customers and products simultaneously. Information-network clustering advances knowledge discovery in two manners. First, hidden associations amongst objects from differing domains are unveiled, leading to a better understanding of the hidden structure of the entire network. Second, local clusters of the objects within a domain are sharpened and put into greater context, leading to more accurate local clustering. In order to extract knowledge of this form, information network clustering algorithms must consider (1) overlapping clusters and (2) a clustering structure that relates the clusters. In this dissertation we develop a framework and several algorithms for information network clustering by leveraging the above two fundamental aspects that facilitate knowledge discovery. Current state-of-the-art information network clustering algorithms have had relative success in addressing the computational challenge of high dimensional data; however, the majority of these approaches have not addressed the fundamental aspects of overlapping clusters and clustering structure. In this dissertation, we address the information-network clustering problem from a fresh perspective and introduce a novel framework based on Formal Concept Analysis (FCA). Based on mathematical order theory, in particular, the theory of complete lattices, FCA provides a rich theoretical basis for investigating and structuring overlapping relational clustering in a single dataset. Shortcomings of previous methods were overcome by extending FCA to information networks, yielding effective and efficient information network clustering algorithms. Several empirical evaluations performed on a large variety of real-world information networks reveal that the FCA-based algorithms work more effectively and efficiently than the current state of the art. Additionally,the dissertation addresses two drawbacks of FCA in single-edge information network (bi-clustering). One drawback is that the set of bi-clusters tends to be quite large, which makes reasoning about the bi-clusters quite difficult. We address this problem by introducing the idea of significant distinguishing sets, and an algorithm to efficiently enumerate these sets. The second shortcoming of the FCA framework is the strict definition of a bi-cluster. FCA specifies that a bi-cluster is a maximal sub-matrix of 1s in the dataset, however, in many knowledge discovery tasks the bi-clusters of interest are those subspaces of objects and attributes that exhibit a banded structure. We show, theoretically, the correspondence between bandedness and FCA based clustering. Moreover, we present an algorithm to uncover banded structures via intelligent search of the bi-cluster lattice.

[1]  Mohammed J. Zaki,et al.  SCHISM: a new approach to interesting subspace mining , 2005, Int. J. Bus. Intell. Data Min..

[2]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[3]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[4]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[5]  Arno J. Knobbe,et al.  Maximally informative k-itemsets and their efficient discovery , 2006, KDD '06.

[6]  Mohammed J. Zaki,et al.  TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data , 2005, SIGMOD '05.

[7]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[8]  Yoav Shoham,et al.  Simple search methods for finding a Nash equilibrium , 2004, Games Econ. Behav..

[9]  Gerd Stumme,et al.  ToscanaJ – An Open Source Tool for Qualitative Data Analysis , 2002 .

[10]  Raj Bhatnagar,et al.  A levelwise search algorithm for interesting subspace clusters , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[11]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[12]  Philip S. Yu,et al.  Enhanced biclustering on expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[13]  Mohammed J. Zaki,et al.  Theoretical Foundations of Association Rules , 2007 .

[14]  Alok N. Choudhary,et al.  Adaptive Grids for Clustering Massive Data Sets , 2001, SDM.

[15]  Raj Bhatnagar,et al.  An Efficient Constraint-Based Closed Set Mining Algorithm , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[16]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[17]  Luc Dehaspe,et al.  Discovery of relational association rules , 2001 .

[18]  M. Dufwenberg Game theory. , 2011, Wiley interdisciplinary reviews. Cognitive science.

[19]  Philip S. Yu,et al.  Unsupervised learning on k-partite graphs , 2006, KDD '06.

[20]  Ran El-Yaniv,et al.  Multi-way distributional clustering via pairwise interactions , 2005, ICML.

[21]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[22]  Gemma C. Garriga,et al.  Banded structure in binary matrices , 2008, Knowledge and Information Systems.

[23]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[24]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[25]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[26]  Raj Bhatnagar,et al.  Discovering Substantial Distinctions among Incremental Bi-Clusters , 2009, SDM.

[27]  Heikki Mannila,et al.  Seriation in Paleontological Data Using Markov Chain Monte Carlo Methods , 2006, PLoS Comput. Biol..

[28]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[29]  Lise Getoor,et al.  Relational clustering for multi-type entity resolution , 2005, MRDM '05.

[30]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[31]  Anne Berry,et al.  A local approach to concept generation , 2007, Annals of Mathematics and Artificial Intelligence.

[32]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[33]  Yizhou Sun,et al.  RankClus: integrating clustering with ranking for heterogeneous information network analysis , 2009, EDBT '09.

[34]  Ruggero G. Pensa,et al.  Towards Fault-Tolerant Formal Concept Analysis , 2005, AI*IA.

[35]  Elliot Mendelson Introducing Game Theory and its Applications , 2004 .

[36]  Panos M. Pardalos,et al.  Biclustering in data mining , 2008, Comput. Oper. Res..

[37]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[38]  Richard Rosen Matrix bandwidth minimization , 1968, ACM National Conference.

[39]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[40]  Ümit V. Çatalyürek,et al.  Permuting Sparse Rectangular Matrices into Block-Diagonal Form , 2004, SIAM J. Sci. Comput..

[41]  Jinyan Li,et al.  Mining Maximal Quasi-Bicliques to Co-Cluster Stocks and Financial Ratios for Value Investment , 2006, Sixth International Conference on Data Mining (ICDM'06).

[42]  René Peeters,et al.  The maximum edge biclique problem is NP-complete , 2003, Discret. Appl. Math..

[43]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[44]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[45]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[46]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[47]  Heikki Mannila,et al.  Nestedness and segmented nestedness , 2007, KDD '07.

[48]  Bei Yu,et al.  A cross-collection mixture model for comparative text mining , 2004, KDD.

[49]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[50]  Raj Bhatnagar,et al.  An effective algorithm for mining 3-clusters in vertically partitioned data , 2008, CIKM '08.

[51]  Jinyan Li,et al.  Maximal Biclique Subgraphs and Closed Pattern Pairs of the Adjacency Matrix: A One-to-One Correspondence and Mining Algorithms , 2007, IEEE Transactions on Knowledge and Data Engineering.

[52]  Padraig Cunningham,et al.  Biclustering of expression data using simulated annealing , 2005, 18th IEEE Symposium on Computer-Based Medical Systems (CBMS'05).

[53]  Philip S. Yu,et al.  Spectral clustering for multi-type relational data , 2006, ICML.

[54]  Douglas B. West,et al.  Classes of Interval Digraphs and 0,1-matrices , 1997 .

[55]  A. Hall Methods for demonstrating Resemblance in Taxonomy and Ecology , 1967, Nature.

[56]  Malay K. Sen,et al.  Indifference Digraphs: A Generalization of Indifference Graphs and Semiorders , 1994, SIAM J. Discret. Math..

[57]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[58]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[59]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[60]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[61]  Geoffrey I. Webb,et al.  On detecting differences between groups , 2003, KDD '03.

[62]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[63]  Ira Assent,et al.  CLICKS: an effective algorithm for mining subspace clusters in categorical datasets , 2005, KDD '05.

[64]  Myoung-Ho Kim,et al.  FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting , 2004, Inf. Softw. Technol..

[65]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[66]  Jinyan Li,et al.  Efficient Mining of Large Maximal Bicliques , 2006, DaWaK.

[67]  Saso Dzeroski,et al.  Multi-relational data mining: an introduction , 2003, SKDD.

[68]  Osmar R. Zaïane,et al.  Contrasting the Contrast Sets: An Alternative Approach , 2007, 11th International Database Engineering and Applications Symposium (IDEAS 2007).

[69]  Sergei O. Kuznetsov,et al.  Algorithms for the Construction of Concept Lattices and Their Diagram Graphs , 2001, PKDD.

[70]  Philip S. Yu,et al.  Cross-relational clustering with user's guidance , 2005, KDD '05.

[71]  Christian Lindig Fast Concept Analysis , 2000 .

[72]  Mohammed J. Zaki,et al.  Efficient algorithms for mining closed itemsets and their lattice structure , 2005, IEEE Transactions on Knowledge and Data Engineering.

[73]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[74]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[75]  Xiaoying Gao,et al.  QC4 - A Clustering Evaluation Method , 2007, PAKDD.