Associative clustering for exploring dependencies between functional genomics data sets

High-throughput genomic measurements, interpreted as cooccurring data samples from multiple sources, open up a fresh problem for machine learning: What is in common in the different data sets, that is, what kind of statistical dependencies are there between the paired samples from the different sets? We introduce a clustering algorithm for exploring the dependencies. Samples within each data set are grouped such that the dependencies between groups of different sets capture as much of pairwise dependencies between the samples as possible. We formalize this problem in a novel probabilistic way, as optimization of a Bayes factor. The method is applied to reveal commonalities and exceptions in gene expression between organisms and to suggest regulatory interactions in the form of dependencies between gene expression profiles and regulator binding patterns.

[1]  Sean B. Carroll,et al.  Genetics and the making of Homo sapiens , 2003, Nature.

[2]  Suzanna Becker,et al.  Mutual information maximization: models of cortical self-organization. , 1996, Network.

[3]  Naftali Tishby,et al.  Multivariate Information Bottleneck , 2001, Neural Computation.

[4]  M. Adams,et al.  Inferring Nonneutral Evolution from Human-Chimp-Mouse Orthologous Gene Trios , 2003, Science.

[5]  Michael A. Beer,et al.  Predicting Gene Expression from Sequence , 2004, Cell.

[6]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[7]  Samuel Kaski,et al.  Associative Clustering , 2004, ECML.

[8]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[9]  J. Sgouros,et al.  Microarray analysis of orthologous genes: conservation of the translational machinery across species at the sequence and expression level , 2002, Genome Biology.

[10]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[11]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[12]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[13]  Samuel Kaski,et al.  Sequential information bottleneck for finite data , 2004, ICML.

[14]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[16]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[18]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[19]  Prahlad T. Ram,et al.  G Protein Pathways , 2002, Science.

[20]  Samuel Kaski,et al.  Clustering Based on Conditional Distributions in an Auxiliary Space , 2002, Neural Computation.

[21]  Samuel Kaski,et al.  Discriminative clustering , 2005, Neurocomputing.

[22]  R. Ewing,et al.  Est Databases as Multi-conditional Gene Expression Datasets , 2022 .

[23]  G. Church,et al.  Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae , 2001, Nature Genetics.

[24]  Jarkko Venna,et al.  Analysis and visualization of gene expression data using Self-Organizing Maps , 2002, Neural Networks.

[25]  Mokhtar S. Bazaraa,et al.  Nonlinear Programming: Theory and Algorithms , 1993 .

[26]  Hidemasa Bono,et al.  Functional transcriptomes: comparative analysis of biological pathways and processes in eukaryotes to infer genetic networks among transcripts. , 2002, Current opinion in structural biology.

[27]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[28]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology , 2003, Nucleic Acids Res..

[29]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[30]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[31]  S. Bergmann,et al.  Similarities and Differences in Genome-Wide Expression Data of Six Organisms , 2003, PLoS biology.

[32]  Samuel Kaski,et al.  Associative Clustering (AC): Technical Details , 2005 .

[33]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[34]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[35]  M. Gerstein,et al.  Complex transcriptional circuitry at the G1/S transition in Saccharomyces cerevisiae. , 2002, Genes & development.

[36]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[37]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[38]  Johannes Gehrke,et al.  A framework for measuring changes in data characteristics , 1999, PODS '99.

[39]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[40]  I. Good On the Application of Symmetric Dirichlet Distributions and their Mixtures to Contingency Tables , 1976 .

[41]  S. Pääbo,et al.  A Neutral Model of Transcriptome Evolution , 2004, PLoS biology.

[42]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Douglas A. Hosack,et al.  Identifying biological themes within lists of genes with EASE , 2003, Genome Biology.

[44]  Noam Slonim,et al.  The Information Bottleneck : Theory and Applications , 2006 .