Genome Functional Annotation using Deep Convolutional Neural Networks

Deep neural network application is today a skyrocketing field in almost all disciplinary domains. In genomics, which deals with DNA sequences, the development of deep neural networks is expected to revolutionize current practice, from fundamental issues such as understanding the evolution of genomes to more specific applications such as the development of personalized medicine. Several approaches have been developed relying on convolution neural networks (CNN) to identify the functional role of sequences such as promoters, enhancers or protein binding sites along genomes. These approaches rely on the generation of sequences batches with known annotations for learning purpose. While they show good performance to predict annotations from a test subset of these batches, they usually work less well when applied genome-wide (i.e., for whole genome annotation). In this paper, we address this issue and propose an optimal strategy to train CNN for this specific application. We use as a case study gene Transcription Start Sites (TSS) and show that a model trained on one organism (e.g., human) can be used to predict TSS in a different species (e.g., mouse). Author summary We propose a method to use deep convolution neural networks in order to label genomes with functional annotations. Functional annotations cover any relevant features which can be associated with specific positions on the genome (e.g., promoters, enhancers, conserved regions). This method is based on a optimized generation of the examples used to train the network in order to deal with the well-known problem of using unbalanced data. When these annotations are known in one species, the trained neural network can be used to predict these annotations in a different species if the mechanisms used to interpret the genomes are conserved in the two species. We use as a case study gene transcription start sites (TSS) in human and show that the model trained on human TSS can be used to recover a similar information on the mouse genome.

[1]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[2]  S. Miyano,et al.  Sequence-specific bias correction for RNA-seq data using recurrent neural networks , 2017, BMC Genomics.

[3]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[4]  Junchi Yan,et al.  Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks , 2017, BMC Genomics.

[5]  A. Bird,et al.  CpG islands and the regulation of transcription. , 2011, Genes & development.

[6]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[7]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[8]  Guohui Chuai,et al.  DeepCRISPR: optimized CRISPR guide RNA design by deep learning , 2018, Genome Biology.

[9]  Ning Chen,et al.  DeepEnhancer: Predicting enhancers by convolutional neural networks , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[10]  David G. Knowles,et al.  Predicting Splicing from Primary Sequence with Deep Learning , 2019, Cell.

[11]  O. Stegle,et al.  DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning , 2016, Genome Biology.

[12]  A. Mchardy,et al.  Finding Genes in Genome Sequence. , 2017, Methods in molecular biology.

[13]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.

[14]  M. Melnick,et al.  The functional genomic response of developing embryonic submandibular glands to NF-kappaB inhibition , 2001, BMC Developmental Biology.

[15]  Gunnar Rätsch,et al.  ARTS: accurate recognition of transcription starts in human , 2006, ISMB.

[16]  Xinghua Shi,et al.  A deep auto-encoder model for gene expression prediction , 2017, BMC Genomics.

[17]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[18]  Sabri Boughorbel,et al.  Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric , 2017, PloS one.

[19]  F. Eisenhaber,et al.  Data Mining Techniques for the Life Sciences , 2010, Methods in Molecular Biology.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[22]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[23]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[24]  Brendan J. Frey,et al.  Deep learning of the tissue-regulated splicing code , 2014, Bioinform..

[25]  J. Goodrich,et al.  Finding the start site: redefining the human initiator element , 2017, Genes & development.

[26]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[27]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[28]  David R. Kelley,et al.  Sequential regulatory activity prediction across chromosomes with convolutional neural networks. , 2018, Genome research.

[29]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[30]  M. Huss,et al.  A primer on deep learning in genomics , 2018, Nature Genetics.

[31]  Hong-Bin Shen,et al.  IPMiner: hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction , 2016, BMC Genomics.

[32]  Predicting protein phosphorylation sites , 2000, Genome Biology.

[33]  Antonino Fiannaca,et al.  Deep learning models for bacteria taxonomic classification of metagenomic data , 2018, BMC Bioinformatics.

[34]  Ben Lehner,et al.  Human genes with CpG island promoters have a distinct transcription-associated chromatin organization , 2012, Genome Biology.

[35]  Ernesto Picardi,et al.  Computational methods for ab initio and comparative gene finding. , 2010, Methods in molecular biology.

[36]  V. Solovyev,et al.  Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks , 2016, PloS one.

[37]  Michael A. Beer,et al.  Discriminative prediction of mammalian enhancers from DNA sequence. , 2011, Genome research.

[38]  B. Ren,et al.  Mapping Human Epigenomes , 2013, Cell.

[39]  Michael Wainberg,et al.  Deep learning in biomedicine , 2018, Nature Biotechnology.