Cross-platform normalization of microarray and RNA-seq data for machine learning applications

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

[1]  Yuri Nikolsky,et al.  Development of a Drug-Response Modeling Framework to Identify Cell Line Derived Translational Biomarkers That Can Predict Treatment Outcome to Erlotinib or Sorafenib , 2015, PloS one.

[2]  Krishna R. Kalari,et al.  Gene Expression, Single Nucleotide Variant and Fusion Transcript Discovery in Archival Material from Breast Tumors , 2013, PloS one.

[3]  J. Climent,et al.  A genomic approach to study down syndrome and cancer inverse comorbidity: untangling the chromosome 21 , 2015, Frontiers in Physiology.

[4]  Francisco Herrera,et al.  A unifying view on dataset shift in classification , 2012, Pattern Recognit..

[5]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[6]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[7]  B. Oliver,et al.  Microarrays, deep sequencing and the true measure of the transcriptome , 2011, BMC Biology.

[8]  Kathleen Marchal,et al.  SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms , 2006, BMC Bioinformatics.

[9]  Larry A. Wasserman,et al.  The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs , 2009, J. Mach. Learn. Res..

[10]  David P. Kreil,et al.  The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance , 2014, Nature Biotechnology.

[11]  Ting Wang,et al.  The UCSC Cancer Genomics Browser , 2009, Nature Methods.

[12]  Aleix Prat Aparicio Comprehensive molecular portraits of human breast tumours , 2012 .

[13]  N. Cox,et al.  Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines , 2014, Genome Biology.

[14]  Mary Goldman,et al.  The UCSC Cancer Genomics Browser: update 2015 , 2014, Nucleic Acids Res..

[15]  P. Rousseeuw,et al.  Partitioning Around Medoids (Program PAM) , 2008 .

[16]  Kwong-Sak Leung,et al.  Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification , 2013, BMC Bioinformatics.

[17]  Winnie S. Liang,et al.  Comparative RNA-Seq and Microarray Analysis of Gene Expression Changes in B-Cell Lymphomas of Canis familiaris , 2013, PloS one.

[18]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[19]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[20]  Stein Aerts,et al.  Comprehensive Analysis of Transcriptome Variation Uncovers Known and Novel Driver Events in T-Cell Acute Lymphoblastic Leukemia , 2013, PLoS genetics.

[21]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[22]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[23]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[24]  A. Chinnaiyan,et al.  RNA-Seq Accurately Identifies Cancer Biomarker Signatures to Distinguish Tissue of Origin1 , 2014, Neoplasia.

[25]  Casey S. Greene,et al.  Training Distribution Matching (TDM) Evaluation and Results , 2015 .

[26]  Maria Keays,et al.  ArrayExpress update—trends in database growth and links to data analysis tools , 2012, Nucleic Acids Res..

[27]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[28]  Olga G. Troyanskaya,et al.  The Sleipnir library for computational functional genomics , 2008, Bioinform..

[29]  Rafael A. Irizarry,et al.  quantro: a data-driven approach to guide the choice of an appropriate normalization method , 2015, Genome Biology.

[30]  Mary Goldman,et al.  The UCSC Cancer Genomics Browser: update 2015 , 2014, Nucleic Acids Res..

[31]  Antti Honkela,et al.  Probe Region Expression Estimation for RNA-Seq Data for Improved Microarray Comparability , 2013, PloS one.

[32]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[33]  Casey S. Greene,et al.  Training Distribution Matching (TDM) Results: Analysis Code for Accepted TDM Manuscript , 2016 .

[34]  Valer Gotea,et al.  Pan-cancer stratification of solid human epithelial tumors and cancer cell lines reveals commonalities and tissue-specific features of the CpG island methylator phenotype , 2015, Epigenetics & Chromatin.

[35]  Casey S. Greene,et al.  Training Distribution Matching (TDM) R Package , 2015 .

[36]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[37]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[38]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[39]  Casey S. Greene,et al.  Unsupervised Feature Construction and Knowledge Extraction from Genome-Wide Assays of Breast Cancer with Denoising Autoencoders , 2014, Pacific Symposium on Biocomputing.

[40]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.