Associative Clustering by Maximizing a Bayes Factor

Abstract Clustering by maximizing the dependency between (margin) group-ings or partitionings of co-occurring data pairs is studied. We sug-gest a probabilistic criterion that generalizes discriminative cluster-ing (DC), an extension of the information bottleneck (IB) principleto labeled continuous data. The criterion is the Bayes factor be-tween models assuming dependence and independence of the twocluster sets, and it can be used as a well-founded criterion for IB forsmall data sets. With suitable prior assumptions the Bayes factoris equivalent to the hypergeometric probability of a contingency ta-ble with the optimized clusters at the margins, and for large datait becomes the standard mutual information. An algorithm fortwo-margin clustering of paired continuous data, associative clus-tering (AC), is introduced. Genes are clustered to find dependen-cies between gene expression and transcription factor binding, anddependencies between expression in different organisms. 1 Introduction Distributional clustering by the information bottleneck (IB) principle [20] groupsnominal values x of a random variable X by maximizing the dependency of thegroups with another, co-occurring discrete variable Y. Clustering documents x bythe occurrences of words y in them is an example. For a continuous X, the analogueof IB is to partition the space of possible values x∈ R

[1]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .

[2]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[3]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[4]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[5]  Noam Slonim,et al.  Maximum Likelihood and the Information Bottleneck , 2002, NIPS.

[6]  I. Good On the Application of Symmetric Dirichlet Distributions and their Mixtures to Contingency Tables , 1976 .

[7]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[9]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[10]  Jim Kay,et al.  Feature discovery under contextual supervision using mutual information , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[11]  Peter G. Schultz,et al.  Large-scale analysis of the human and mouse , 2002 .

[12]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[13]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[15]  Wray L. Buntine Variational Extensions to EM and Multinomial PCA , 2002, ECML.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Tommi S. Jaakkola,et al.  Kernel Expansions with Unlabeled Examples , 2000, NIPS.

[18]  David J. Miller,et al.  A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data , 1996, NIPS.

[19]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[20]  Gal Chechik,et al.  Extracting Relevant Structures with Side Information , 2002, NIPS.

[21]  Samuel Kaski,et al.  Discriminative Clustering: Optimal Contingency Tables by Learning Metrics , 2002, ECML.

[22]  Samuel Kaski,et al.  Clustering Based on Conditional Distributions in an Auxiliary Space , 2002, Neural Computation.

[23]  Ben Taskar,et al.  Rich probabilistic models for gene expression , 2001, ISMB.

[24]  Samuel Kaski,et al.  Regularized discriminative clustering , 2003, 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718).