Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously

Motivation Large compendia of gene expression data have proven valuable for the discovery of novel biological relationships. The majority of available RNA assays are run on microarray, while RNA-seq is becoming the platform of choice for new experiments. The data structure and distributions between the platforms differ, making it challenging to combine them. We performed supervised and unsupervised machine learning evaluations, as well as differential expression analyses, to assess which normalization methods are best suited for combining microarray and RNA-seq data. Results We find that quantile and Training Distribution Matching normalization allow for supervised and unsupervised model training on microarray and RNA-seq data simultaneously. Nonparanormal normalization and z-scores are also appropriate for some applications, including differential expression analysis. Availability and Implementation These analyses were performed in R and are available at https://www.github.com/greenelab/RNAseq_titration_results under a BSD-3 clause license. Contact csgreene@upenn.edu Supplementary Information is available.

[1]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[4]  Jie Tan,et al.  Cross-platform normalization of microarray and RNA-seq data for machine learning applications , 2016, PeerJ.

[5]  Francesco Vallania,et al.  Methods to increase reproducibility in differential gene expression via meta-analysis , 2016, Nucleic acids research.

[6]  Robert Petryszak,et al.  ArrayExpress update—simplifying data submissions , 2014, Nucleic Acids Res..

[7]  C. Greene,et al.  ADAGE-Based Integration of Publicly Available Pseudomonas aeruginosa Gene Expression Data with Denoising Autoencoders Illuminates Microbe-Host Interactions , 2016, mSystems.

[8]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[9]  George Athanasopoulos,et al.  Forecasting: principles and practice , 2013 .

[10]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[11]  Daniel S. Himmelstein,et al.  Understanding multicellular function and disease with human tissue-specific networks , 2015, Nature Genetics.

[12]  Larry A. Wasserman,et al.  The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs , 2009, J. Mach. Learn. Res..

[13]  Yuri Nikolsky,et al.  Development of a Drug-Response Modeling Framework to Identify Cell Line Derived Translational Biomarkers That Can Predict Treatment Outcome to Erlotinib or Sorafenib , 2015, PloS one.

[14]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[15]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[16]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[17]  P. Khatri,et al.  Robust classification of bacterial and viral infections via integrated host gene expression diagnostics , 2016, Science Translational Medicine.

[18]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[21]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[22]  Purvesh Khatri,et al.  Integrated, Multi-cohort Analysis Identifies Conserved Transcriptional Signatures across Multiple Respiratory Viruses , 2015, Immunity.

[23]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[24]  David Venet,et al.  Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome , 2011, PLoS Comput. Biol..

[25]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[26]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[27]  Antti Honkela,et al.  Probe Region Expression Estimation for RNA-Seq Data for Improved Microarray Comparability , 2013, PloS one.