Gene expression microarray data analysis demystified.

The increasing use of gene expression microarrays, and depositing of the resulting data into public repositories, means that more investigators are interested in using the technology either directly or through meta analysis of the publicly available data. The tools available for data analysis have generally been developed for use by experts in the field, making them difficult to use by the general research community. For those interested in entering the field, especially those without a background in statistics, it is difficult to understand why experimental results can be so variable. The purpose of this review is to go through the workflow of a typical microarray experiment, to show that decisions made at each step, from choice of platform through statistical analysis methods to biological interpretation, are all sources of this variability.

[1]  Rafael A. Irizarry,et al.  A Model-Based Background Adjustment for Oligonucleotide Expression Arrays , 2004 .

[2]  Giorgio Valle,et al.  The Gene Ontology project in 2008 , 2007, Nucleic Acids Res..

[3]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[4]  R. Lempicki,et al.  Evaluation of gene expression measurements from commercial microarray platforms. , 2003, Nucleic acids research.

[5]  R A Irizarry,et al.  On the utility of pooling biological samples in microarray experiments. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Kathleen F. Kerr,et al.  Standardizing global gene expression analysis between laboratories and across platforms , 2005, Nature Methods.

[7]  Atul J. Butte,et al.  Evaluation and integration of 49 genome-wide experiments and the prediction of previously unknown obesity-related genes , 2007, Bioinform..

[8]  Susan G Hilsenbeck,et al.  Reproducibility, sources of variability, pooling, and sample size: important considerations for the design of high-density oligonucleotide array experiments. , 2004, The journals of gerontology. Series A, Biological sciences and medical sciences.

[9]  S. Enkemann,et al.  A sequence-based identification of the genes detected by probesets on the Affymetrix U133 plus 2.0 array , 2005, Nucleic acids research.

[10]  Eivind Hovig,et al.  Options available for profiling small samples: a review of sample amplification technology when combined with microarray profiling , 2006, Nucleic acids research.

[11]  J Carl Barrett,et al.  Microarrays : the use of oligonucleotides and cDNA for the analysis of gene expression , 2003 .

[12]  Yudi Pawitan,et al.  Filtering genes to improve sensitivity in oligonucleotide microarray data analysis. , 2007, Nucleic acids research.

[13]  Wolfram Liebermeister,et al.  Linear modes of gene expression determined by independent component analysis , 2002, Bioinform..

[14]  Yoav Benjamini,et al.  Identifying differentially expressed genes using false discovery rate controlling procedures , 2003, Bioinform..

[15]  Thomas Ragg,et al.  The RIN: an RNA integrity number for assigning integrity values to RNA measurements , 2006, BMC Molecular Biology.

[16]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[17]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[18]  M Kathleen Kerr,et al.  Design considerations for efficient and effective microarray studies. , 2003, Biometrics.

[19]  David Haussler,et al.  The UCSC genome browser database: update 2007 , 2006, Nucleic Acids Res..

[20]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[21]  Kathleen F. Kerr,et al.  Extended analysis of benchmark datasets for Agilent two-color microarrays , 2007, BMC Bioinformatics.

[22]  Alan P. Sprague,et al.  Reproducible Clusters from Microarray Research: Whither? , 2005, BMC Bioinformatics.

[23]  Yudi Pawitan,et al.  Detecting differential expression in microarray data: comparison of optimal procedures , 2007, BMC Bioinformatics.

[24]  James J. Chen,et al.  Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data , 2007, BMC Bioinformatics.

[25]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[26]  Tao Han,et al.  Cross-platform comparability of microarray technology: Intra-platform consistency and appropriate data analysis procedures are essential , 2005, BMC Bioinformatics.

[27]  P. Galle,et al.  Current bioinformatics tools in genomic biomedical research (Review). , 2006, International journal of molecular medicine.

[28]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[29]  Chang Hee Kim,et al.  Three microarray platforms: an analysis of their concordance in profiling gene expression , 2005, BMC Genomics.

[30]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[31]  W. Huber,et al.  Model-based variance-stabilizing transformation for Illumina microarray data , 2008, Nucleic acids research.

[32]  David Bryant,et al.  DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists , 2007, Nucleic Acids Res..

[33]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[34]  Sue-Jane Wang,et al.  Sample size for gene expression microarray experiments , 2005, Bioinform..

[35]  Roger E Bumgarner,et al.  Sample size for detecting differentially expressed genes in microarray experiments , 2004, BMC Genomics.

[36]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[37]  Kiyoko F. Aoki-Kinoshita,et al.  Gene annotation and pathway mapping in KEGG. , 2007, Methods in molecular biology.

[38]  J. Eberwine,et al.  Amplified RNA synthesized from limited quantities of heterogeneous cDNA. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Xin Lu,et al.  Re-sampling strategy to improve the estimation of number of null hypotheses in FDR control under strong correlation structures , 2007, BMC Bioinformatics.

[40]  William Stafford Noble,et al.  The effect of replication on gene expression microarray experiments , 2003, Bioinform..

[41]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[42]  Gordon K. Smyth,et al.  A comparison of background correction methods for two-colour microarrays , 2007, Bioinform..

[43]  Shu-Dong Zhang,et al.  A statistical framework for the design of microarray experiments and effective detection of differential gene expression , 2004, Bioinform..

[44]  Dennis B. Troup,et al.  NCBI GEO: mining tens of millions of expression profiles—database and tools update , 2006, Nucleic Acids Res..

[45]  Nan Guo,et al.  PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways , 2006, Nucleic Acids Res..

[46]  Wei Chen,et al.  Comparison of seven methods for producing Affymetrix expression scores based on False Discovery Rates in disease profiling data , 2005, BMC Bioinformatics.

[47]  Eric P. Hoffman,et al.  Probe set algorithms: is there a rational best bet? , 2006, BMC Bioinformatics.

[48]  Steven C. Lawlor,et al.  MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data , 2003, Genome Biology.

[49]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[51]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[52]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[53]  Leming Shi,et al.  Using RNA sample titrations to assess microarray platform performance and normalization techniques , 2006, Nature Biotechnology.

[54]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[55]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[56]  John Quackenbush,et al.  Extracting biology from high-dimensional biological data , 2007, Journal of Experimental Biology.

[57]  Rafael A. Irizarry,et al.  Comparison of Affymetrix GeneChip expression measures , 2006, Bioinform..

[58]  Hao Li,et al.  Analysis of oligonucleotide array experiments with repeated measures using mixed models , 2004, BMC Bioinformatics.

[59]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[60]  Yudong D. He,et al.  Effects of atmospheric ozone on microarray data quality. , 2003, Analytical chemistry.

[61]  Weida Tong,et al.  Evaluation of external RNA controls for the assessment of microarray performance , 2006, Nature Biotechnology.

[62]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[63]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[64]  R. Shippy,et al.  Performance evaluation of commercial short-oligonucleotide microarrays and the impact of noise in making cross-platform correlations , 2004, BMC Genomics.

[65]  Philip M. Long,et al.  Optimal gene expression analysis by microarrays. , 2002, Cancer cell.

[66]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[67]  Susmita Datta,et al.  Evaluation of clustering algorithms for gene expression data , 2006, BMC Bioinformatics.

[68]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[69]  James A. Cuff,et al.  Distinguishing protein-coding and noncoding genes in the human genome , 2007, Proceedings of the National Academy of Sciences.

[70]  R. Myers,et al.  Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data , 2005, Nucleic acids research.

[71]  Stephen C. Harris,et al.  Rat toxicogenomic study reveals analytical consistency across microarray platforms , 2006, Nature Biotechnology.

[72]  W. Pan,et al.  How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach , 2002, Genome Biology.

[73]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[74]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[75]  Yudong D. He,et al.  Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer , 2001, Nature Biotechnology.

[76]  John D. Storey A direct approach to false discovery rates , 2002 .

[77]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[78]  Russ B. Altman,et al.  Time to Organize the Bioinformatics Resourceome , 2005, PLoS Comput. Biol..

[79]  G. Schuler Pieces of the puzzle: expressed sequence tags and the catalog of human genes , 1997, Journal of Molecular Medicine.

[80]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[81]  Kathleen F Kerr,et al.  What is the best reference RNA? And other questions regarding the design and analysis of two-color microarray experiments. , 2007, Omics : a journal of integrative biology.

[82]  Yudi Pawitan,et al.  Multidimensional local false discovery rate for microarray studies , 2006, Bioinform..

[83]  Z. Szallasi,et al.  Reliability and reproducibility issues in DNA microarray measurements. , 2006, Trends in genetics : TIG.

[84]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[85]  Giovanni Parmigiani,et al.  Pre-processing Agilent microarray data , 2007, BMC Bioinformatics.

[86]  Rainer Breitling,et al.  A verification protocol for the probe sequences of Affymetrix genome arrays reveals high probe accuracy for studies in mouse, human and rat , 2007, BMC Bioinformatics.

[87]  Michal J. Okoniewski,et al.  Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations , 2006, BMC Bioinformatics.

[88]  Paul C. Boutros,et al.  Unsupervised pattern recognition: An introduction to the whys and wherefores of clustering microarray data , 2005, Briefings Bioinform..

[89]  Mario Medvedovic,et al.  Intensity-based hierarchical Bayes method improves testing for differentially expressed genes in microarray experiments , 2006, BMC Bioinformatics.

[90]  Zoltan Szallasi,et al.  Increased measurement accuracy for sequence-verified microarray probes. , 2004, Physiological genomics.

[91]  Wei Zhang,et al.  Improving signal intensities for genes with low-expression on oligonucleotide microarrays , 2004, BMC Genomics.

[92]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[93]  Kathleen F. Kerr,et al.  The External RNA Controls Consortium: a progress report , 2005, Nature Methods.

[94]  Catalin C. Barbacioru,et al.  Evaluation of DNA microarray results with quantitative gene expression platforms , 2006, Nature Biotechnology.

[95]  David Botstein,et al.  BMC Genomics BioMed Central Methodology article Universal Reference RNA as a standard for microarray experiments , 2004 .

[96]  Scott McMillan,et al.  Conducting Research on the Web: 2007 Update for the Bioinformatics Links Directory , 2007, Nucleic Acids Res..

[97]  Tao Wang,et al.  Statistically designing microarrays and microarray experiments to enhance sensitivity and specificity , 2006, Briefings Bioinform..

[98]  Helen E. Parkinson,et al.  ArrayExpress—a public database of microarray experiments and gene expression profiles , 2006, Nucleic Acids Res..

[99]  Pierre-Antoine Absil,et al.  Elucidating the Altered Transcriptional Programs in Breast Cancer using Independent Component Analysis , 2007, PLoS Comput. Biol..

[100]  Lei Liu,et al.  A study of inter-lab and inter-platform agreement of DNA microarray data , 2005, BMC Genomics.

[101]  Joshua M. Stuart,et al.  MICROARRAY EXPERIMENTS : APPLICATION TO SPORULATION TIME SERIES , 1999 .

[102]  May D. Wang,et al.  GoMiner: a resource for biological interpretation of genomic and proteomic data , 2003, Genome Biology.

[103]  K. Peck,et al.  Optimization of probe length and the number of probes per gene for optimal microarray analysis of gene expression. , 2004, Nucleic acids research.

[104]  Jean-Jacques Daudin,et al.  Biases induced by pooling samples in microarray experiments , 2007, ISMB/ECCB.

[105]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[106]  T. Speed,et al.  Design issues for cDNA microarray experiments , 2002, Nature Reviews Genetics.

[107]  Terry Speed,et al.  Normalization of cDNA microarray data. , 2003, Methods.

[108]  David P. Kreil,et al.  There is no silver bullet - a guide to low-level data transforms and normalisation methods for microarray data , 2005, Briefings Bioinform..

[109]  K. Kinzler,et al.  Gene expression analysis goes digital , 2007, Nature Biotechnology.

[110]  Isaac S. Kohane,et al.  Redefinition of Affymetrix probe sets by sequence overlap with cDNA microarray probes reduces cross-platform inconsistencies in cancer-associated gene expression measurements , 2005, BMC Bioinformatics.

[111]  Hong-Wen Deng,et al.  Gene selection for classification of microarray data based on the Bayes error , 2007, BMC Bioinformatics.

[112]  Pengyuan Liu,et al.  Common Human Cancer Genes Discovered by Integrated Gene-Expression Analysis , 2007, PloS one.

[113]  Olga G. Troyanskaya,et al.  Putting microarrays in a context: Integrated analysis of diverse biological data , 2005, Briefings Bioinform..

[114]  P. Collins,et al.  Performance comparison of one-color and two-color platforms within the Microarray Quality Control (MAQC) project , 2006, Nature Biotechnology.

[115]  K. Kinzler,et al.  Serial Analysis of Gene Expression , 1995, Science.

[116]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[117]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.