Fully automatic cross-associations

Large, sparse binary matrices arise in numerous data mining applications, such as the analysis of market baskets, web graphs, social networks, co-citations, as well as information retrieval, collaborative filtering, sparse matrix reordering, etc. Virtually all popular methods for the analysis of such matrices---e.g., k-means clustering, METIS graph partitioning, SVD/PCA and frequent itemset mining---require the user to specify various parameters, such as the number of clusters, number of principal components, number of partitions, and "support." Choosing suitable values for such parameters is a challenging problem.Cross-association is a joint decomposition of a binary matrix into disjoint row and column groups such that the rectangular intersections of groups are homogeneous. Starting from first principles, we furnish a clear, information-theoretic criterion to choose a good cross-association as well as its parameters, namely, the number of row and column groups. We provide scalable algorithms to approach the optimal. Our algorithm is parameter-free, and requires no user intervention. In practice it scales linearly with the problem size, and is thus applicable to very large matrices. Finally, we present experiments on multiple synthetic and real-life datasets, where our method gives high-quality, intuitive results.

[1]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[2]  Aidong Zhang,et al.  Mining multiple phenotype structures underlying gene expression profiles , 2003, CIKM '03.

[3]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[4]  Jorma Rissanen,et al.  Generalized Kraft Inequality and Arithmetic Coding , 1976, IBM J. Res. Dev..

[5]  Umeshwar Dayal,et al.  K-Harmonic Means - A Spatial Clustering Algorithm with Boosting , 2000, TSDM.

[6]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[7]  Mokshay Madiman,et al.  Minimum description length vs. maximum likelihood in lossy data compression , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[8]  Matthew Richardson,et al.  Trust Management for the Semantic Web , 2003, SEMWEB.

[9]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[10]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[11]  Tamara G. Kolda,et al.  A semidiscrete matrix decomposition for latent semantic indexing information retrieval , 1998, TOIS.

[12]  Naftali Tishby,et al.  Multivariate Information Bottleneck , 2001, Neural Computation.

[13]  Glen G. Langdon,et al.  Arithmetic Coding , 1979 .

[14]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[15]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[16]  Vipin Kumar,et al.  Multilevel Algorithms for Multi-Constraint Graph Partitioning , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[17]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[18]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[19]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[20]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[21]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[22]  Jie Wu,et al.  Small Worlds: The Dynamics of Networks between Order and Randomness , 2003 .

[23]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[24]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[25]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[26]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[27]  Masaru Kitsuregawa,et al.  An approach to relate the Web communities through bipartite graphs , 2001, Proceedings of the Second International Conference on Web Information Systems Engineering.

[28]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[29]  Daniel S. Hirschberg,et al.  Data compression , 1987, CSUR.

[30]  Christos Faloutsos,et al.  Identifying Web Browsing Trends and Patterns , 2001, Computer.

[31]  Gediminas Adomavicius,et al.  Handling very large numbers of association rules in the analysis of microarray data , 2002, KDD.

[32]  Dana Ron,et al.  On Finding Large Conjunctive Clusters , 2003, COLT.

[33]  D. Watts,et al.  Small Worlds: The Dynamics of Networks between Order and Randomness , 2001 .