Computational analysis of microarray gene expression profiles: clustering, classification, and beyond

Abstract Gene array studies can assess the global expression patterns of thousands of genes under multiple conditions. This technology can provide important insights about the underlying genetic causes of many important biological questions, and can change our understanding of diseases, ultimately allowing the development of novel chemical entities as potential drug candidates. The informatics analysis and integration of gene expression pattern are critical for interpreting gene array studies. In this paper, we discuss the computational analysis of three important tasks: (1) the identification of differentially expressed genes, (2) the discovery of gene clusters, and (3) the classification of biological samples. In addition, we discuss how gene sequence and chemical structures can be profitably combined with microarray studies. Detailed examples are given throughout. Programs written in open source R language for achieving each of these tasks are freely available at gila.engr.uic.edu/genex.

[1]  Stephen H. Friend,et al.  Mining the NCI Anticancer Drug Discovery Databases: Genetic Function Approximation for the QSAR Study of Anticancer Ellipticine Analogues , 1998 .

[2]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[3]  Ian Holmes,et al.  Finding Regulatory Elements Using Joint Likelihoods for Sequence and Expression Profile Data , 2000, ISMB.

[4]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[6]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[7]  Jun S. Liu,et al.  Bayesian inference on biopolymer models , 1999, Bioinform..

[8]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[10]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[11]  G S Michaels,et al.  Cluster analysis and data visualization of large-scale gene expression data. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[12]  D. Botstein,et al.  A gene expression database for the molecular pharmacology of cancer , 2000, Nature Genetics.

[13]  F. Downton,et al.  Introduction to Mathematical Statistics , 1959 .

[14]  J. R. Koehler,et al.  Modern Applied Statistics with S-Plus. , 1996 .

[15]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[16]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Joel H. Saltz,et al.  Classification of small cell lung cancer and pulmonary carcinoid by gene expression profiles. , 1999, Cancer research.

[18]  R. Penrose A Generalized inverse for matrices , 1955 .

[19]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[20]  C. Auffray,et al.  Novel gene transcripts preferentially expressed in human muscles revealed by quantitative hybridization of a high density cDNA array. , 1996, Genome research.

[21]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[22]  Mark Waltham,et al.  Mining and Visualizing Large Anticancer Drug Discovery Databases. , 2000 .

[23]  Bart Kosko,et al.  Neural networks for signal processing , 1992 .

[24]  L. Penland,et al.  Use of a cDNA microarray to analyse gene expression patterns in human cancer , 1996, Nature Genetics.

[25]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[26]  D. Lipman,et al.  Extracting protein alignment models from the sequence database. , 1997, Nucleic acids research.

[27]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[29]  L. Wodicka,et al.  Genome-wide expression monitoring in Saccharomyces cerevisiae , 1997, Nature Biotechnology.

[30]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[31]  Jun S. Liu,et al.  Gibbs motif sampling: Detection of bacterial outer membrane protein repeats , 1995, Protein science : a publication of the Protein Society.

[32]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Generalization Ability of Support Vector Machines , 2001 .

[33]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[34]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[35]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[36]  G. S. Johnson,et al.  An Information-Intensive Approach to the Molecular Pharmacology of Cancer , 1997, Science.

[37]  M. F. Fuller,et al.  Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[38]  J N Weinstein,et al.  A protein expression database for the molecular pharmacology of cancer , 1997, Electrophoresis.

[39]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[40]  Christian A. Rees,et al.  Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[41]  G. A. Whitmore,et al.  Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[42]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[44]  K. Nakai Protein sorting signals and prediction of subcellular localization. , 2000, Advances in protein chemistry.

[45]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[46]  J. Barker,et al.  Large-scale temporal gene expression mapping of central nervous system development. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[47]  E. Sausville,et al.  Characterization of anticancer agents by their growth inhibitory activity and relationships to mechanism of action and structure. , 2000, Anti-cancer drug design.

[48]  Jun Zhu,et al.  Bayesian Adaptive Alignment and Inference , 1997, ISMB.

[49]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[50]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[51]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[52]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[53]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[54]  Hongyu Zhao,et al.  Assessing reliability of gene clusters from gene expression data , 2000, Functional & Integrative Genomics.

[55]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[56]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.