Functional bioinformatics of microarray data: from expression to regulation

Using microarrays is a powerful technique to monitor the expression of thousands of genes in a single experiment. From series of such experiments, it is possible to identify the mechanisms that govern the activation of genes in an organism. Short deoxyribonucleic acid patterns (called binding sites) near the genes serve as switches that control gene expression. As a result similar patterns of expression can correspond to similar binding site patterns. Here we integrate clustering of coexpressed genes with the discovery of binding motifs. We overview several important clustering techniques and present a clustering algorithm (called adaptive quality-based clustering), which we have developed to address several shortcomings of existing methods. We overview the different techniques for motif finding, in particular the technique of Gibbs sampling, and we present several extensions of this technique in our Motif Sampler. Finally, we present an integrated web tool called INCLUSive (available online at http://www.esat.kuleuven.ac.be//spl sim/dna/BioI/Software.html) that allows the easy analysis of microarray data for motif finding.

[1]  P. Bucher,et al.  Regulatory elements and expression profiles. , 1999, Current opinion in structural biology.

[2]  Rainer Fuchs,et al.  Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters , 2001, Bioinform..

[3]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[4]  Kathleen Marchal,et al.  Adaptive quality-based clustering of gene expression profiles , 2002, Bioinform..

[5]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[6]  J. Stephen Judd,et al.  Learning in neural networks , 1988, COLT '88.

[7]  Jun S. Liu,et al.  Bayesian inference on biopolymer models , 1999, Bioinform..

[8]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.

[9]  J. Chiche,et al.  Localization of rat genes in the nitric oxide signaling pathway: candidates for the pathogenesis of complex diseases , 1999, Mammalian Genome.

[10]  S. Duane,et al.  Hybrid Monte Carlo , 1987 .

[11]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[12]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[13]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[14]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[15]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[16]  Francisco Azuaje,et al.  A cluster validity framework for genome expression data , 2002, Bioinform..

[17]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[18]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[19]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[20]  Marie-France Sagot,et al.  Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[21]  M. Sagot,et al.  Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals. , 2000, Journal of molecular biology.

[22]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[23]  Martin Tompa,et al.  An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.

[24]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[26]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[27]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[28]  H Niemann,et al.  Identification and analysis of eukaryotic promoters: recent computational approaches. , 2001, Trends in genetics : TIG.

[29]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[30]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[31]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[32]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 2000, Nucleic Acids Res..

[33]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.

[35]  G. Stormo,et al.  Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. , 1992, Journal of molecular biology.

[36]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[37]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[38]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[39]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[40]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[41]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[42]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[43]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[44]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[45]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[46]  G. Sherlock Analysis of large-scale gene expression data. , 2000, Current opinion in immunology.

[47]  G. Stormo,et al.  Ann-spec: a Method for Discovering Transcription Factor Binding Sites with Improved Specificity , 2022 .

[48]  J. Liu,et al.  Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. , 2001, Nucleic acids research.

[49]  Adrian E. Raftery,et al.  MCLUST: Software for Model-Based Cluster Analysis , 1999 .

[50]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[51]  M. Q. Zhang,et al.  Cluster, function and promoter: analysis of yeast expression array. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[52]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[53]  T. Werner Models for prediction and recognition of eukaryotic promoters , 1999, Mammalian Genome.

[54]  Debashis Ghosh,et al.  Mixture modelling of gene expression data from microarray experiments , 2002, Bioinform..

[55]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[56]  Nir Friedman,et al.  Class discovery in gene expression data , 2001, RECOMB.

[57]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .

[58]  Kathleen Marchal,et al.  INCLUSive: INtegrated Clustering, Upstream sequence retrieval and motif Sampling , 2002, Bioinform..

[59]  S. P. Fodor,et al.  High density synthetic oligonucleotide arrays , 1999, Nature Genetics.

[60]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[61]  M. Bittner,et al.  Expression profiling using cDNA microarrays , 1999, Nature Genetics.

[62]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[63]  Kathleen Marchal,et al.  PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences , 2002, Nucleic Acids Res..

[64]  Gary D. Stormo,et al.  Identifying target sites for cooperatively binding factors , 2001, Bioinform..