DataRemix: a universal data transformation for optimal inference from gene expression datasets

RNAseq technology provides unprecedented power in the assessment of the transcription abundance and can be used to perform a variety of downstream tasks such as inference of gene-correlation network and eQTL discovery. However, raw gene expression values have to be normalized for nuisance biological variation and technical covariates, and different normalization strategies can lead to dramatically different results in the downstream study. We describe a generalization of SVD-based reconstruction for which the common techniques of whitening, rank-k approximation, and removing the top k principle components are special cases. Our simple three-parameter transformation, DataRemix, can be tuned to reweight the contribution of hidden factors and reveal otherwise hidden biological signals. In particular, we demonstrate that the method can effectively prioritize biological signals over noise without leveraging external dataset-specific knowledge, and can outperform normalization methods that make explicit use of known technical factors. We also show that DataRemix can be efficiently optimized via Thompson Sampling approach, which makes it feasible for computationally expensive objectives such as eQTL analysis. Finally, we apply our method to the ROSMAP dataset and we report what to our knwoledge is the first replicable trans-eQTL effect in human brain.

[1]  Benjamin A. Logsdon,et al.  Gene Expression Elucidates Functional Impact of Polygenic Risk for Schizophrenia , 2016, Nature Neuroscience.

[2]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[3]  Matthew W. Hoffman,et al.  Predictive Entropy Search for Efficient Global Optimization of Black-box Functions , 2014, NIPS.

[4]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[5]  A. Battle Characterizing the genetic basis of transcriptome diversity through RNA-sequencing , 2013 .

[6]  Michiaki Kubo,et al.  Genome-Wide Association and Replication Study of Hepatotoxicity Induced by Antiretrovirals Alone or with Concomitant Anti-Tuberculosis Drugs. , 2017, Omics : a journal of integrative biology.

[7]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[9]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[10]  Souvik Ghosh,et al.  Analysis of Thompson Sampling for Gaussian Process Optimization in the Bandit Setting , 2017, 1705.06808.

[11]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[12]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[13]  Adam Williams,et al.  Interchromosomal association and gene regulation in trans. , 2010, Trends in genetics : TIG.

[14]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[15]  Tariq Ahmad,et al.  Genome-wide meta-analysis increases to 71 the number of confirmed Crohn's disease susceptibility loci , 2010, Nature Genetics.

[16]  J. Friedman Exploratory Projection Pursuit , 1987 .

[17]  Alan M. Kwong,et al.  Next-generation genotype imputation service and methods , 2016, Nature Genetics.

[18]  Chun Jimmie Ye,et al.  Accurate Discovery of Expression Quantitative Trait Loci Under Confounding From Spurious and Genuine Regulatory Hotspots , 2008, Genetics.

[19]  David S. Touretzky,et al.  Advances in neural information processing systems 2 , 1989 .

[20]  S. Klinke,et al.  Exploratory Projection Pursuit , 1995 .

[21]  William J. Astle,et al.  Allelic Landscape of Human Blood Cell Trait Variation and Links , 2016 .

[22]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[23]  David Heckerman,et al.  Correction for hidden confounders in the genetic analysis of gene expression , 2010, Proceedings of the National Academy of Sciences.

[24]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[25]  D. Lizotte Practical bayesian optimization , 2008 .

[26]  Daphne Koller,et al.  Normalizing RNA-Sequencing Data by Modeling Hidden Covariates with Prior Knowledge , 2013, PloS one.

[27]  David C. Wilson,et al.  Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease , 2016, Nature Genetics.

[28]  Leopold Parts,et al.  A Bayesian Framework to Account for Complex Non-Genetic Factors in Gene Expression Levels Greatly Increases Power in eQTL Studies , 2010, PLoS Comput. Biol..

[29]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[30]  Tariq Ahmad,et al.  Meta-analysis identifies 29 additional ulcerative colitis risk loci, increasing the number of confirmed associations to 47 , 2011, Nature Genetics.

[31]  W. Shi,et al.  The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote , 2013, Nucleic acids research.

[32]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[33]  D. Koller,et al.  Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals , 2013, Genome research.