Associating phenotypes with molecular events: recent statistical advances and challenges underpinning microarray experiments

Progress in mapping the genome and developments in array technologies have provided large amounts of information for delineating the roles of genes involved in complex diseases and quantitative traits. Since complex phenotypes are determined by a network of interrelated biological traits typically involving multiple inter-correlated genetic and environmental factors that interact in a hierarchical fashion, microarrays hold tremendous latent information. The analysis of microarray data is, however, still a bottleneck. In this paper, we review the recent advances in statistical analyses for associating phenotypes with molecular events underpinning microarray experiments. Classical statistical procedures to analyze phenotypes in genetics are reviewed first, followed by descriptions of the statistical procedures for linking molecular events to measured gene expression phenotypes (microarray-based gene expression) and observed phenotypes such as diseases status. These statistical procedures include (1) prior analysis, such as data quality controls, and normalization analyses for minimizing the effects of experimental artifacts and random noise; (2) gene selections and differentiation procedures based on inferential statistics for the class comparisons; (3) dynamic temporal patterns analysis through exploratory statistics such as unsupervised clustering and supervised classification and predictions; (4) assessing the reliability of microarray studies using real-time PCR and the reproducibility issues from many studies and multiple platforms. In addition, the post analysis to associate the discovered patterns of gene expression to pathway and functional analysis for selected genes are also considered in order to increase our understanding of interconnected gene processes.

[1]  Giovanni Parmigiani,et al.  A Cross-Study Comparison of Gene Expression Studies for the Molecular Classification of Lung Cancer , 2004, Clinical Cancer Research.

[2]  Yi Li,et al.  Bayesian automatic relevance determination algorithms for classifying gene expression data. , 2002, Bioinformatics.

[3]  G. A. Whitmore,et al.  Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[4]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[6]  Gordon K. Smyth,et al.  Use of within-array replicate spots for assessing differential expression in microarray experiments , 2005, Bioinform..

[7]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[8]  Bradley P. Carlin,et al.  Bayesian measures of model complexity and fit , 2002 .

[9]  G A Whitmore,et al.  Power and sample size for DNA microarray studies , 2002, Statistics in medicine.

[10]  Members of the Complex Trait Consortium Standardizing global gene expression analysis between laboratories and across platforms , 2005 .

[11]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  J. Bailar The promise and problems of meta-analysis. , 1997, The New England journal of medicine.

[13]  A Y Yakovlev,et al.  Variable selection and pattern recognition with gene expression data generated by the microarray technology. , 2002, Mathematical biosciences.

[14]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[15]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[16]  Pierre R. Bushel,et al.  Assessing Gene Significance from cDNA Microarray Expression Data via Mixed Models , 2001, J. Comput. Biol..

[17]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[18]  Kristina Hanspers,et al.  Spotted long oligonucleotide arrays for human gene expression analysis. , 2003, Genome research.

[19]  D. Slonim From patterns to pathways: gene expression data analysis comes of age , 2002, Nature Genetics.

[20]  Nonparametric methods for analyzing replication origins in genomewide data , 2004, Functional & Integrative Genomics.

[21]  Arpad Kelemen,et al.  Differential and trajectory methods for time course gene expression data , 2005, Bioinform..

[22]  R. Doerge,et al.  Empirical threshold values for quantitative trait mapping. , 1994, Genetics.

[23]  P. Sham Statistics in human genetics , 1997 .

[24]  Rebecca W. Doerge,et al.  Statistical issues in the search for genes affecting quantitative traits in experimental populations , 1997 .

[25]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Susmita Datta,et al.  Comparisons and validation of statistical clustering techniques for microarray gene expression data , 2003, Bioinform..

[27]  T. H. Bø,et al.  LSimpute: accurate estimation of missing values in microarray data with least squares methods. , 2004, Nucleic acids research.

[28]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[29]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[30]  R. Doerge,et al.  Permutation tests for multiple loci affecting a quantitative character. , 1996, Genetics.

[31]  Douglas A. Hosack,et al.  Identifying biological themes within lists of genes with EASE , 2003, Genome Biology.

[32]  R W Cottingham,et al.  Error detection for genetic data, using likelihood methods. , 1996, American journal of human genetics.

[33]  M. Oh,et al.  Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. , 2001, Nucleic acids research.

[34]  Neal S. Holter,et al.  Fundamental patterns underlying gene expression profiles: simplicity from complexity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Johan A. K. Suykens,et al.  Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction , 2004, Bioinform..

[36]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[37]  Homin K. Lee,et al.  Coexpression analysis of human genes across many microarray data sets. , 2004, Genome research.

[38]  Richard M. Simon,et al.  Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data , 2002, Bioinform..

[39]  P. Brown,et al.  Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Momiao Xiong,et al.  Tclass: tumor classification system based on gene expression profile , 2002, Bioinform..

[41]  Paola Sebastiani,et al.  Cluster analysis of gene expression dynamics , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Wei Pan,et al.  A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments , 2002, Bioinform..

[43]  M. Sillanpää,et al.  Model choice in gene mapping: what and why. , 2002, Trends in genetics : TIG.

[44]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[45]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[46]  B. Rannala,et al.  The Bayesian revolution in genetics , 2004, Nature Reviews Genetics.

[47]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[48]  W. Pan,et al.  How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach , 2002, Genome Biology.

[49]  Debashis Ghosh,et al.  Statistical issues and methods for meta-analysis of microarray data: a case study in prostate cancer , 2003, Functional & Integrative Genomics.

[50]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[51]  H. Akaike A new look at the statistical model identification , 1974 .

[52]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[53]  B. Efron,et al.  Empirical Bayes Methods for Combining Likelihoods: Comment , 1996 .

[54]  R. Ball,et al.  Bayesian methods for quantitative trait loci mapping based on model selection: approximate analysis using the Bayesian information criterion. , 2001, Genetics.

[55]  B. Efron Empirical Bayes Methods for Combining Likelihoods , 1996 .

[56]  R W Doerge,et al.  Accounting for Variability in the Use of Permutation Testing to Detect Quantitative Trait Loci , 2000, Biometrics.

[57]  Lev Klebanov,et al.  Multivariate search for differentially expressed gene combinations , 2004, BMC Bioinformatics.

[58]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[59]  Adrian E. Raftery,et al.  Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors , 1999 .

[60]  I. Jolliffe Principal Component Analysis , 2002 .

[61]  Sylvia Richardson,et al.  Bayesian Hierarchical Model for Identifying Changes in Gene Expression from Microarray Experiments , 2002, J. Comput. Biol..

[62]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[63]  Zoubin Ghahramani,et al.  A Bayesian approach to reconstructing genetic regulatory networks with hidden factors , 2005, Bioinform..

[64]  Bryan Frank,et al.  Independence and reproducibility across microarray platforms , 2005, Nature Methods.

[65]  A. P. Dawid,et al.  Bayesian Model Averaging and Model Search Strategies , 2007 .

[66]  R. Doerge Multifactorial genetics: Mapping and analysis of quantitative trait loci in experimental populations , 2002, Nature Reviews Genetics.

[67]  Stefano Toppo,et al.  Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification. , 2003, Human molecular genetics.

[68]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[69]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[70]  H. Robbins An Empirical Bayes Approach to Statistics , 1956 .

[71]  Walter L. Ruzzo,et al.  Bayesian Classification of DNA Array Expression Data , 2000 .

[72]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[73]  D. Haussler,et al.  Knowledge-based analysis of microarray gene expression , 2000 .

[74]  A. Darvasi Genomics: Gene expression meets genetics , 2003, Nature.

[75]  J. Ibrahim,et al.  Bayesian Models for Gene Expression With DNA Microarray Data , 2002 .

[76]  M. Eisen,et al.  Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering , 2002, Genome Biology.

[77]  John Quackenbush,et al.  Multiple-laboratory comparison of microarray platforms , 2005, Nature Methods.

[78]  P M Visscher,et al.  Confidence intervals in QTL mapping by bootstrapping. , 1996, Genetics.

[79]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[80]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[81]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[82]  Michael E. Wall,et al.  SVDMAN-singular value decomposition analysis of microarray data , 2001, Bioinform..

[83]  Patrick Tan,et al.  Genetic algorithms applied to multi-class prediction for the analysis of gene expression data , 2003, Bioinform..

[84]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[85]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[86]  Russ B. Altman,et al.  Nonparametric methods for identifying differentially expressed genes in microarray data , 2002, Bioinform..

[87]  Douglas M. Hawkins,et al.  A variance-stabilizing transformation for gene-expression microarray data , 2002, ISMB.

[88]  Gary A. Churchill,et al.  Analysis of Variance for Gene Expression Microarray Data , 2000, J. Comput. Biol..

[89]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[90]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[91]  Ben Hui Liu,et al.  Statistical Genomics: Linkage, Mapping, and QTL Analysis , 1997 .

[92]  Hongzhe Li,et al.  Model-based methods for identifying periodically expressed genes based on time course microarray gene expression data , 2004, Bioinform..

[93]  Empirical Bayes estimation of gene-specific effects in micro-array research , 2006, Functional & Integrative Genomics.

[94]  Karl W. Broman,et al.  A model selection approach for the identification of quantitative trait loci in experimental crosses , 2002 .

[95]  B. Weir Genetic Data Analysis II. , 1997 .

[96]  Juliane Fluck,et al.  Microarrays: how many do you need? , 2002, RECOMB '02.

[97]  Terry Speed,et al.  Normalization of cDNA microarray data. , 2003, Methods.

[98]  C. Morris Parametric Empirical Bayes Inference: Theory and Applications , 1983 .

[99]  Patrik Edén,et al.  Comparing Functional Annotation Analyses with Catmap Comparing Functional Annotation Analyses with Catmap , 2004 .

[100]  R. W. Doerge,et al.  Calculation of the minimum number of replicate spots required for detection of significant gene expression fold change in microarray experiments , 2002, Bioinform..

[101]  D. Ghosh,et al.  Covariate adjustment in the analysis of microarray data from clinical studies , 2004, Functional & Integrative Genomics.

[102]  Mike West,et al.  Bayesian Regression Analysis in the "Large p, Small n" Paradigm with Application in DNA Microarray S , 2000 .

[103]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[104]  D. Botstein,et al.  Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease , 2003, Nature Genetics.

[105]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .

[106]  T. Speed,et al.  GOstat: find statistically overrepresented Gene Ontologies within a group of genes. , 2004, Bioinformatics.

[107]  Atul J. Butte,et al.  Reproducibility of gene expression across generations of Affymetrix microarrays , 2003, BMC Bioinformatics.

[108]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[109]  Aili Wang,et al.  Normalization of cDNA microarray data by using neural networks , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[110]  R. Lempicki,et al.  Evaluation of gene expression measurements from commercial microarray platforms. , 2003, Nucleic acids research.

[111]  Terence P. Speed,et al.  Normalization for cDNA microarry data , 2001, SPIE BiOS.

[112]  P. Müller,et al.  A Bayesian mixture model for differential gene expression , 2005 .

[113]  B. Efron,et al.  Data Analysis Using Stein's Estimator and its Generalizations , 1975 .

[114]  Yulan Liang,et al.  Hierarchical Bayesian Neural Network for Gene Expression Temporal Patterns , 2004, Statistical applications in genetics and molecular biology.

[115]  P. Hedley,et al.  A comparative analysis of transcript abundance using SAGE and Affymetrix arrays , 2005, Functional & Integrative Genomics.

[116]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[117]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.