Integrated data analysis for genome-wide research.

Integrated data analysis is introduced as the intermediate level of a systems biology approach to analyse different 'omics' datasets, i.e., genome-wide measurements of transcripts, protein levels or protein-protein interactions, and metabolite levels aiming at generating a coherent understanding of biological function. In this chapter we focus on different methods of correlation analyses ranging from simple pairwise correlation to kernel canonical correlation which were recently applied in molecular biology. Several examples are presented to illustrate their application. The input data for this analysis frequently originate from different experimental platforms. Therefore, preprocessing steps such as data normalisation and missing value estimation are inherent to this approach. The corresponding procedures, potential pitfalls and biases, and available software solutions are reviewed. The multiplicity of observations obtained in omics-profiling experiments necessitates the application of multiple testing correction techniques.

[1]  Joachim Selbig,et al.  Non-linear PCA: a missing data approach , 2005, Bioinform..

[2]  R. Aebersold,et al.  Equipping scientists for the new biology , 2000, Nature Biotechnology.

[3]  Andres Kriete,et al.  Combined histomorphometric and gene-expression profiling applied to toxicology , 2003, Genome Biology.

[4]  Sui Huang,et al.  Gene Expression Dynamics Inspector (GEDI): for integrative analysis of expression profiles , 2003, Bioinform..

[5]  Benno Schwikowski,et al.  Discovering regulatory and signalling circuits in molecular interaction networks , 2002, ISMB.

[6]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[7]  J K McLaughlin,et al.  Selection of controls in case-control studies. I. Principles. , 1992, American journal of epidemiology.

[8]  I S Kohane,et al.  Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[9]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[10]  G S Michaels,et al.  Cluster analysis and data visualization of large-scale gene expression data. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[11]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Timothy Galitski,et al.  Inventories to insights , 2003, The Journal of cell biology.

[13]  A. Podtelejnikov,et al.  Linking genome and proteome by mass spectrometry: large-scale identification of yeast proteins from two dimensional gels. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[14]  M. Mann,et al.  Proteomics to study genes and genomes , 2000, Nature.

[15]  Jens Nilsson,et al.  Approximate geodesic distances reveal biologically relevant structures in microarray data , 2004, Bioinform..

[16]  M. Tyers,et al.  Osprey: a network visualization system , 2003, Genome Biology.

[17]  E. Winzeler,et al.  Treasures and traps in genome-wide data sets: case examples from yeast , 2002, Nature Reviews Genetics.

[18]  Susumu Goto,et al.  The KEGG databases at GenomeNet , 2002, Nucleic Acids Res..

[19]  D. Noble Modeling the Heart--from Genes to Cells to the Whole Organ , 2002, Science.

[20]  P. Zimmermann,et al.  Gene-expression analysis and network discovery using Genevestigator. , 2005, Trends in plant science.

[21]  G. Church,et al.  Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae , 2001, Nature Genetics.

[22]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[23]  Sangdun Choi,et al.  Current issues for DNA microarrays: platform comparison, double linear amplification, and universal RNA reference. , 2004, Journal of biotechnology.

[24]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[25]  M. Vidal,et al.  Protein interaction maps for model organisms , 2001, Nature Reviews Molecular Cell Biology.

[26]  David B. Searls,et al.  Data integration: challenges for drug discovery , 2005, Nature Reviews Drug Discovery.

[27]  Yoshihiro Yamanishi,et al.  Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis , 2003, ISMB.

[28]  Roland Somogyi,et al.  Modeling the complexity of genetic networks: Understanding multigenic and pleiotropic regulation , 1996, Complex..

[29]  J. Weinstein 'Omic' and hypothesis-driven research in the molecular pharmacology of cancer. , 2002, Current opinion in pharmacology.

[30]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[31]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[32]  S. Gygi,et al.  Correlation between Protein and mRNA Abundance in Yeast , 1999, Molecular and Cellular Biology.

[33]  Trey Ideker,et al.  Transcriptome profiling to identify genes involved in peroxisome assembly and function , 2002, The Journal of cell biology.

[34]  Carsten O. Daub,et al.  The mutual information: Detecting and evaluating dependencies between variables , 2002, ECCB.

[35]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[36]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[37]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[38]  Per Broberg,et al.  A comparative review of estimates of the proportion unchanged genes and the false discovery rate , 2005, BMC Bioinformatics.

[39]  P. Qiu Recent advances in computational promoter analysis in understanding the transcriptional regulatory network. , 2003, Biochemical and biophysical research communications.

[40]  J. Barker,et al.  Large-scale temporal gene expression mapping of central nervous system development. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.

[42]  D Repsilber,et al.  Sample selection for microarray gene expression studies. , 2005, Methods of information in medicine.

[43]  O. Fiehn,et al.  Metabolite profiling for plant functional genomics , 2000, Nature Biotechnology.

[44]  P. Zimmermann,et al.  GENEVESTIGATOR. Arabidopsis Microarray Database and Analysis Toolbox1[w] , 2004, Plant Physiology.

[45]  Joachim Selbig,et al.  Metabolite fingerprinting: detecting biological features by independent component analysis , 2004, Bioinform..

[46]  Carsten O. Daub,et al.  MetaGeneAlyse: analysis of integrated transcriptional and metabolite data , 2003, Bioinform..

[47]  J. Selbig,et al.  Parallel analysis of transcript and metabolic profiles: a new approach in systems biology , 2003, EMBO reports.

[48]  L. Hood,et al.  Complementary Profiling of Gene Expression at the Transcriptome and Proteome Levels in Saccharomyces cerevisiae*S , 2002, Molecular & Cellular Proteomics.

[49]  D. E. Roberts,et al.  The Upper Tail Probabilities of Spearman's Rho , 1975 .

[50]  Christian Wissel,et al.  Aims and limits of ecological modelling exemplified by island theory , 1992 .

[51]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[52]  S. Rhee,et al.  MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. , 2004, The Plant journal : for cell and molecular biology.

[53]  Carsten O. Daub,et al.  Estimating mutual information using B-spline functions – an improved similarity measure for analysing gene expression data , 2004, BMC Bioinformatics.

[54]  Ute Roessner,et al.  Metabolic Profiling Allows Comprehensive Phenotyping of Genetically or Environmentally Modified Plant Systems , 2001, Plant Cell.

[55]  W. Weckwerth Metabolomics in systems biology. , 2003, Annual review of plant biology.

[56]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[57]  A. Fernie,et al.  Metabolite profiling: from diagnostics to systems biology , 2004, Nature Reviews Molecular Cell Biology.

[58]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[59]  Brian E Snydsman,et al.  Assigning function to yeast proteins by integration of technologies. , 2003, Molecular cell.

[60]  L. Stein,et al.  The Plant Ontology (TM) Consortium and plant ontologies , 2002 .