Gene expression inference with deep learning

Motivation Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations, etc. Although the cost of whole-genome expression profiles has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ῀1,000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression, limiting its accuracy since it does not capture complex nonlinear relationship between expression of genes. Results We present a deep learning method (abbreviated as D-GEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based GEO dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms linear regression with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than linear regression in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2,921 expression profiles. Deep learning still outperforms linear regression with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes. Availability D-GEX is available at https://github.com/uci-cbcl/D-GEX. Contact xhx@ics.uci.edu

[1]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[2]  Xiaohui Xie,et al.  DANN: a deep learning approach for annotating the pathogenicity of genetic variants , 2015, Bioinform..

[3]  David P. Kreil,et al.  The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance , 2014, Nature Biotechnology.

[4]  Brendan J. Frey,et al.  Deep learning of the tissue-regulated splicing code , 2014, Bioinform..

[5]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[6]  Xiaohui S. Xie,et al.  Low-Rank Regularization for Learning Gene Expression Programs , 2013, PloS one.

[7]  Pierre Baldi,et al.  Understanding Dropout , 2013, NIPS.

[8]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[9]  Ian J. Goodfellow,et al.  Pylearn2: a machine learning research library , 2013, ArXiv.

[10]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[11]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[12]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Pierre Baldi,et al.  Deep architectures for protein contact map prediction , 2012, Bioinform..

[15]  Geoffrey E. Hinton,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[16]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[21]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[22]  D. di Bernardo,et al.  How to infer gene networks from expression profiles , 2007, Molecular systems biology.

[23]  I. Simon,et al.  Reconstructing dynamic regulatory maps , 2007, Molecular systems biology.

[24]  Paul A Clemons,et al.  The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease , 2006, Science.

[25]  T. Golub,et al.  A method for high-throughput gene expression signature analysis , 2006, Genome Biology.

[26]  Farren J. Isaacs,et al.  Computational studies of gene regulatory networks: in numero molecular biology , 2001, Nature Reviews Genetics.

[27]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[28]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[29]  H. Theil,et al.  Economic Forecasts and Policy. , 1959 .

[30]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[31]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[32]  Yifei Chen,et al.  Machine Learning for Large-Scale Genomics: Algorithms, Models and Applications , 2014 .

[33]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[34]  Yoshua Bengio Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[35]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..