Associative Clustering for Exploring Dependencies between Functional Genomics Data Sets

High-throughput genomic measurements, interpreted as cooccurring data samples from multiple sources, open up a fresh problem for machine learning: What is in common in the different data sets, that is, what kind of statistical dependencies are there between the paired samples from the different sets? We introduce a clustering algorithm for exploring the dependencies. Samples within each data set are grouped such that the dependencies between groups of different sets capture as much of pairwise dependencies between the samples as possible. We formalize this problem in a novel probabilistic way, as optimization of a Bayes factor. The method is applied to reveal commonalities and exceptions in gene expression between organisms and to suggest regulatory interactions in the form of dependencies between gene expression profiles and regulator binding patterns.

[1]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology , 2003, Nucleic Acids Res..

[2]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[3]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[4]  Johannes Gehrke,et al.  A framework for measuring changes in data characteristics , 1999, PODS '99.

[5]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[6]  Samuel Kaski,et al.  Associative Clustering (AC): Technical Details , 2005 .

[7]  Douglas A. Hosack,et al.  Identifying biological themes within lists of genes with EASE , 2003, Genome Biology.

[8]  Samuel Kaski,et al.  Discriminative clustering , 2005, Neurocomputing.

[9]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[10]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[11]  L. Wasserman,et al.  Computing Bayes Factors by Combining Simulation and Asymptotic Approximations , 1997 .

[12]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[13]  Samuel Kaski,et al.  Associative Clustering , 2004, ECML.

[14]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[15]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[16]  Samuel Kaski,et al.  Clustering Based on Conditional Distributions in an Auxiliary Space , 2002, Neural Computation.

[17]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[18]  S. Pääbo,et al.  Intra- and Interspecific Variation in Primate Gene Expression Patterns , 2002, Science.

[19]  G. Church,et al.  Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae , 2001, Nature Genetics.

[20]  Prahlad T. Ram,et al.  G Protein Pathways , 2002, Science.

[21]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[22]  M. Gerstein,et al.  Complex transcriptional circuitry at the G1/S transition in Saccharomyces cerevisiae. , 2002, Genes & development.

[23]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[24]  J. Sgouros,et al.  Microarray analysis of orthologous genes: conservation of the translational machinery across species at the sequence and expression level , 2002, Genome Biology.

[25]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[26]  Jarkko Venna,et al.  Analysis and visualization of gene expression data using Self-Organizing Maps , 2002, Neural Networks.

[27]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[28]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[29]  S. Bergmann,et al.  Similarities and Differences in Genome-Wide Expression Data of Six Organisms , 2003, PLoS biology.

[30]  R. Ewing,et al.  Est Databases as Multi-conditional Gene Expression Datasets , 2022 .

[31]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[32]  Samuel Kaski,et al.  Sequential information bottleneck for finite data , 2004, ICML.

[33]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Mokhtar S. Bazaraa,et al.  Nonlinear Programming: Theory and Algorithms , 1993 .

[35]  Hidemasa Bono,et al.  Functional transcriptomes: comparative analysis of biological pathways and processes in eukaryotes to infer genetic networks among transcripts. , 2002, Current opinion in structural biology.

[36]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[37]  S. Pääbo,et al.  A Neutral Model of Transcriptome Evolution , 2004, PLoS biology.

[38]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Suzanna Becker,et al.  Mutual information maximization: models of cortical self-organization. , 1996, Network.

[40]  Naftali Tishby,et al.  Multivariate Information Bottleneck , 2001, Neural Computation.

[41]  Michael A. Beer,et al.  Predicting Gene Expression from Sequence , 2004, Cell.

[42]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[43]  I. Good On the Application of Symmetric Dirichlet Distributions and their Mixtures to Contingency Tables , 1976 .

[44]  Sean B. Carroll,et al.  Genetics and the making of Homo sapiens , 2003, Nature.

[45]  M. Adams,et al.  Inferring Nonneutral Evolution from Human-Chimp-Mouse Orthologous Gene Trios , 2003, Science.