A fully scalable online pre-processing algorithm for short oligonucleotide microarray atlases

Rapid accumulation of large and standardized microarray data collections is opening up novel opportunities for holistic characterization of genome function. The limited scalability of current preprocessing techniques has, however, formed a bottleneck for full utilization of these data resources. Although short oligonucleotide arrays constitute a major source of genome-wide profiling data, scalable probe-level techniques have been available only for few platforms based on pre-calculated probe effects from restricted reference training sets. To overcome these key limitations, we introduce a fully scalable online-learning algorithm for probe-level analysis and pre-processing of large microarray atlases involving tens of thousands of arrays. In contrast to the alternatives, our algorithm scales up linearly with respect to sample size and is applicable to all short oligonucleotide platforms. The model can use the most comprehensive data collections available to date to pinpoint individual probes affected by noise and biases, providing tools to guide array design and quality control. This is the only available algorithm that can learn probe-level parameters based on sequential hyperparameter updates at small consecutive batches of data, thus circumventing the extensive memory requirements of the standard approaches and opening up novel opportunities to take full advantage of contemporary microarray collections.

[1]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[2]  W. D. de Vos,et al.  Development and application of the human intestinal tract chip, a phylogenetic microarray: analysis of universally conserved phylotypes in the abundant microbiota of young and elderly adults , 2009, Environmental microbiology.

[3]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[4]  R. Myers,et al.  Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data , 2005, Nucleic acids research.

[5]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[6]  Tero Aittokallio,et al.  Integrating probe-level expression changes across generations of Affymetrix arrays , 2005, Nucleic acids research.

[7]  Ulrich Mansmann,et al.  affyPara—a Bioconductor Package for Parallelized Preprocessing Algorithms of Affymetrix Microarray Data , 2009, Bioinformatics and biology insights.

[8]  Eoin L. Brodie,et al.  Urban aerosols harbor diverse and dynamic bacterial populations , 2007, Proceedings of the National Academy of Sciences.

[9]  Dennis B. Troup,et al.  NCBI GEO: mining millions of expression profiles—database and tools , 2004, Nucleic Acids Res..

[10]  S. Batalov,et al.  A gene atlas of the mouse and human protein-encoding transcriptomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[11]  W. D. de Vos,et al.  The adult intestinal core microbiota is determined by analysis depth and health status. , 2012, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[12]  G. Andersen,et al.  Bacterial Diversity Analysis of Huanglongbing Pathogen-Infected Citrus, Using PhyloChip Arrays and 16S rRNA Gene Clone Library Sequencing , 2009, Applied and Environmental Microbiology.

[13]  A. Brix Bayesian Data Analysis, 2nd edn , 2005 .

[14]  H. Parkinson,et al.  A global map of human gene expression , 2010, Nature Biotechnology.

[15]  Faramarz Valafar,et al.  Empirical comparison of cross-platform normalization methods for gene expression data , 2011, BMC Bioinformatics.

[16]  Rafael A. Irizarry,et al.  Stochastic models inspired by hybridization theory for short oligonucleotide arrays , 2004, J. Comput. Biol..

[17]  Terence P. Speed,et al.  A benchmark for Affymetrix GeneChip expression measures , 2004, Bioinform..

[18]  Rafael A Irizarry,et al.  Frozen robust multiarray analysis (fRMA). , 2010, Biostatistics.

[19]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[20]  Rafael A. Irizarry,et al.  A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database , 2006, BMC Bioinformatics.

[21]  Tero Aittokallio,et al.  Probabilistic Analysis of Probe Reliability in Differential Gene Expression Studies with Short Oligonucleotide Arrays , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  Klaus Obermayer,et al.  A new summarization method for affymetrix probe level data , 2006, Bioinform..

[23]  Isaac S. Kohane,et al.  Quantifying the white blood cell transcriptome as an accessible window to the multiorgan transcriptome , 2012, Bioinform..

[24]  Ibrahim Emam,et al.  ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments , 2010, Nucleic Acids Res..

[25]  H. Parkinson,et al.  Large scale comparison of global gene expression patterns in human and mouse , 2010, Genome Biology.

[26]  Matthew N. McCall,et al.  Thawing Frozen Robust Multi-array Analysis (fRMA) , 2011, BMC Bioinformatics.

[27]  Chunyu Liu,et al.  Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods , 2011, PloS one.

[28]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  Rafael A. Irizarry,et al.  A Model-Based Background Adjustment for Oligonucleotide Expression Arrays , 2004 .

[31]  Bonnie Berger,et al.  Making sense out of massive data by going beyond differential expression , 2012, Proceedings of the National Academy of Sciences.

[32]  A. Brazma,et al.  Reuse of public genome-wide gene expression data , 2012, Nature Reviews Genetics.

[33]  Guide to Probe Logarithmic Intensity Error ( PLIER ) Estimation , 2005 .

[34]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[35]  D. Goldfarb A family of variable-metric methods derived by variational means , 1970 .

[36]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[37]  W. D. de Vos,et al.  Advanced Approaches to Characterize the Human Intestinal Microbiota by Computational Meta-analysis , 2010, Journal of clinical gastroenterology.

[38]  Matthew N. McCall,et al.  fRMA ST: frozen robust multiarray analysis for Affymetrix Exon and Gene ST arrays , 2012, Bioinform..

[39]  Y. Xing,et al.  Probe Selection and Expression Index Computation of Affymetrix Exon Arrays , 2006, PloS one.

[40]  J. Astola,et al.  Systematic bioinformatic analysis of expression levels of 17,330 human genes across 9,783 samples from 175 types of healthy and pathological tissues , 2008, Genome Biology.