Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning

Knowing the sequence specificities of DNA- and RNA-binding proteins is essential for developing models of the regulatory processes in biological systems and for identifying causal disease variants. Here we show that sequence specificities can be ascertained from experimental data with 'deep learning' techniques, which offer a scalable, flexible and unified computational approach for pattern discovery. Using a diverse array of experimental data and evaluation metrics, we find that deep learning outperforms other state-of-the-art methods, even when training on in vitro data and testing on in vivo data. We call this approach DeepBind and have built a stand-alone software tool that is fully automatic and handles millions of sequences per experiment. Specificities determined by DeepBind are readily visualized as a weighted ensemble of position weight matrices or as a 'mutation map' that indicates how variations affect binding within a specific sequence.

[1]  R. Ornstein,et al.  An optimized potential function for the calculation of nucleic acid interaction energies I. Base stacking , 1978, Biopolymers.

[2]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[3]  H. Blöcker,et al.  Predicting DNA duplex stability from the base sequence. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[5]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[6]  R. Blake,et al.  Stacking energies in DNA. , 1991, The Journal of biological chemistry.

[7]  J. Shay,et al.  A transcriptionally active DNA-binding site for human p53 protein complexes , 1992, Molecular and cellular biology.

[8]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[9]  C R Calladine,et al.  The assessment of the geometry of dinucleotide steps in double-helical DNA; a new local calculation scheme. , 1995, Journal of molecular biology.

[10]  J. Stévenin,et al.  The RNA-Binding Protein TIA-1 Is a Novel Mammalian Splicing Regulator Acting through Intron Sequences Adjacent to a 5′ Splice Site , 2000, Molecular and Cellular Biology.

[11]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[12]  R. Young,et al.  Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays , 2004, Nature Genetics.

[13]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[14]  B. Blencowe,et al.  An RNA map predicting Nova-dependent splicing regulation , 2006, Nature.

[15]  Florian C. Oberstrass,et al.  Shape-specific recognition in the structure of the Vts1p SAM domain with RNA , 2006, Nature Structural &Molecular Biology.

[16]  A. Philippakis,et al.  Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities , 2006, Nature Biotechnology.

[17]  Alexandre V. Morozov,et al.  Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE , 2006, ISMB.

[18]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[19]  Vip Viprakasit,et al.  A Regulatory SNP Causes a Human Genetic Disease by Creating a New Transcriptional Promoter , 2006, Science.

[20]  Magdalena I. Swanson,et al.  PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation , 2007, Genome Biology.

[21]  Victor G. Levitsky,et al.  Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions , 2007, BMC Bioinformatics.

[22]  Dustin E. Schones,et al.  High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[23]  Xiaoyu Chen,et al.  RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors , 2007, ISMB/ECCB.

[24]  I. Huhtaniemi,et al.  GATA-4 regulates Bcl-2 expression in ovarian granulosa cell tumors. , 2008, Endocrinology.

[25]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[26]  Daniel E. Newburger,et al.  Variation in Homeodomain DNA Binding Revealed by High-Resolution Analysis of Sequence Preferences , 2008, Cell.

[27]  P. Park,et al.  Design and analysis of ChIP-seq experiments for DNA-binding proteins , 2008, Nature Biotechnology.

[28]  Gene W. Yeo,et al.  Genome-wide analysis of PTB-RNA interactions reveals a strategy used by the general splicing repressor to modulate exon inclusion or skipping. , 2009, Molecular cell.

[29]  Lourdes Peña Castillo,et al.  Rapid and systematic analysis of the RNA recognition specificities of RNA-binding proteins , 2009, Nature Biotechnology.

[30]  Rajat Raina,et al.  Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.

[31]  Daniel E. Newburger,et al.  Diversity and Complexity in DNA Recognition by Transcription Factors , 2009, Science.

[32]  Xiang-Jun Lu,et al.  3DNALandscapes: a database for exploring the conformational features of DNA , 2009, Nucleic Acids Res..

[33]  Juan M. Vaquerizas,et al.  Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. , 2010, Genome research.

[34]  K. Pollard,et al.  Detection of nonneutral substitution rates on mammalian phylogenies. , 2010, Genome research.

[35]  J. Fak,et al.  Chaolin Zhang and Its Combinatorial Controls Integrative Modeling Defines the Nova Splicing-Regulatory Network , 2013 .

[36]  R. Mann,et al.  Origins of specificity in protein-DNA recognition. , 2010, Annual review of biochemistry.

[37]  Alexander Vologodskii,et al.  Sequence dependence of DNA bending rigidity , 2010, Proceedings of the National Academy of Sciences.

[38]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[39]  Vsevolod J. Makeev,et al.  Deep and wide digging for binding motifs in ChIP-Seq data , 2010, Bioinform..

[40]  M. Care,et al.  An extended set of PRDM1/BLIMP1 target genes links binding motif type to dynamic repression , 2010, Nucleic acids research.

[41]  Quaid Morris,et al.  RNAcontext: A New Method for Learning the Sequence and Structure Binding Preferences of RNA-Binding Proteins , 2010, PLoS Comput. Biol..

[42]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[43]  N. Plana,et al.  Functional analysis of LDLR promoter and 5′ UTR mutations in subjects with clinical diagnosis of familial hypercholesterolemia , 2011, Human mutation.

[44]  M. Gribskov,et al.  The role of RNA sequence and structure in RNA--protein interactions. , 2011, Journal of molecular biology.

[45]  Philip Machanick,et al.  MEME-ChIP: motif analysis of large DNA datasets , 2011, Bioinform..

[46]  G. Stormo,et al.  Quantitative analysis demonstrates most transcription factors require only simple models of specificity , 2011, Nature Biotechnology.

[47]  S. Luo,et al.  Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument , 2011, Nature Biotechnology.

[48]  Kate B. Cook,et al.  RBPDB: a database of RNA-binding specificities , 2010, Nucleic Acids Res..

[49]  Mingming Jia,et al.  COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer , 2010, Nucleic Acids Res..

[50]  Marc D. Perry,et al.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[51]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[52]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[53]  Neelroop Parikshak,et al.  RBFOX1 regulates both splicing and transcriptional networks in human neuronal development. , 2012, Human molecular genetics.

[54]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[55]  Eric T. Wang,et al.  Transcriptome-wide Regulation of Pre-mRNA Splicing and mRNA Localization by Muscleblind Proteins , 2012, Cell.

[56]  William Stafford Noble,et al.  Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors , 2012, Genome research.

[57]  F. Allain,et al.  RRM-RNA recognition: NMR or crystallography…and new findings. , 2013, Current opinion in structural biology.

[58]  Alexander van Oudenaarden,et al.  Highly expressed loci are vulnerable to misleading ChIP localization of multiple unrelated proteins , 2013, Proceedings of the National Academy of Sciences.

[59]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[60]  Jernej Ule,et al.  RNAmotifs: prediction of multivalent RNA motifs that control alternative splicing , 2013, Genome Biology.

[61]  Victor G. Levitsky,et al.  From binding motifs in Chip-seq Data to Improved Models of transcription factor binding Sites , 2013, J. Bioinform. Comput. Biol..

[62]  D. Schadendorf,et al.  TERT Promoter Mutations in Familial and Sporadic Melanoma , 2013, Science.

[63]  Atina G. Coté,et al.  Evaluation of methods for modeling transcription factor sequence specificity , 2013, Nature Biotechnology.

[64]  Geoffrey E. Hinton,et al.  On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[65]  Brendan J. Frey,et al.  A compendium of RNA-binding motifs for decoding gene regulation , 2013, Nature.

[66]  Eric T. Wang,et al.  MBNL proteins repress ES-cell-specific alternative splicing and reprogramming , 2013, Nature.

[67]  V. Makeev,et al.  Application of experimentally verified transcription factor binding sites models for computational analysis of ChIP-Seq data , 2014, BMC Genomics.

[68]  Juan M. Vaquerizas,et al.  DNA-Binding Specificities of Human Transcription Factors , 2013, Cell.

[69]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[70]  R. Young,et al.  Transcriptional Regulation and Its Misregulation in Disease , 2013, Cell.

[71]  P. Stenson,et al.  The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine , 2013, Human Genetics.

[72]  David J. Arenillas,et al.  JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles , 2013, Nucleic Acids Res..

[73]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[74]  P. Ellen Grant,et al.  Evolutionarily Dynamic Alternative Splicing of GPR56 Regulates Regional Cerebral Cortical Patterning , 2014, Science.

[75]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[76]  R. Gordân,et al.  Protein–DNA binding: complexities and multi-protein codes , 2013, Nucleic acids research.

[77]  Chibo Hong,et al.  The transcription factor GABP selectively binds and activates the mutant TERT promoter in cancer , 2015, Science.

[78]  D. Schadendorf,et al.  Highly Recurrent TERT Promoter Mutations in Human Melanoma , 2022 .