Biological Sequence Modeling with Convolutional Kernel Networks

The growing number of annotated biological sequences available makes it possible to learn genotype-phenotype relationships from data with increasingly high accuracy. When large quantities of labeled samples are available for training a model, convolutional neural networks can be used to predict the phenotype of unannotated sequences with good accuracy. Unfortunately, their performance with medium- or small-scale datasets is mitigated, which requires inventing new data-efficient approaches. In this paper, we introduce a hybrid approach between convolutional neural networks and kernel methods to model biological sequences. Our method enjoys the ability of convolutional neural networks to learn data representations that are adapted to a specific task, while the kernel point of view yields algorithms that perform significantly better when the amount of training data is small. We illustrate these advantages for transcription factor binding prediction and protein homology detection, and we demonstrate that our model is also simple to interpret, which is crucial for discovering predictive motifs in sequences. The source code is freely available at https://gitlab.inria.fr/dchen/CKN-seq.

[1]  David T. Jones,et al.  High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features , 2018, Bioinform..

[2]  E. Topol,et al.  The personal and clinical utility of polygenic risk scores , 2018, Nature Reviews Genetics.

[3]  Alexis Battle,et al.  GBAT: a gene-based association method for robust trans-gene regulation detection , 2018, bioRxiv.

[4]  Sheng Wang,et al.  Protein threading using residue co-variation and deep learning , 2018, Bioinform..

[5]  Mary E. Haas,et al.  Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations , 2018, Nature Genetics.

[6]  Kuldip K. Paliwal,et al.  Accurate prediction of protein contact maps by coupling residual two-dimensional bidirectional long short-term memory with convolutional neural networks , 2018, Bioinform..

[7]  Qing Wu,et al.  ComplexContact: a web server for inter-protein contact prediction using deep learning , 2018, Nucleic Acids Res..

[8]  Sheng Wang,et al.  RaptorX-Angle: real-value prediction of protein backbone dihedral angles through a hybrid method of clustering and deep learning , 2018, BMC Bioinformatics.

[9]  David R. Kelley,et al.  Sequential regulatory activity prediction across chromosomes with convolutional neural networks. , 2018, Genome research.

[10]  Alessandro Barbato,et al.  Continuous Automated Model EvaluatiOn (CAMEO) complementing the critical assessment of structure prediction in CASP12 , 2018, Proteins.

[11]  Frank DiMaio,et al.  Protein structure prediction using Rosetta in CASP12 , 2018, Proteins.

[12]  Yang Zhang,et al.  Template‐based and free modeling of I‐TASSER and QUARK pipelines using predicted contact maps in CASP12 , 2018, Proteins.

[13]  Michael M. Hoffman,et al.  Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome , 2018, Genome Biology.

[14]  Mohammed AlQuraishi,et al.  End-to-end differentiable learning of protein structure , 2018, bioRxiv.

[15]  O. Andreassen,et al.  Polygenic hazard score to guide screening for aggressive prostate cancer: development and validation in large scale cohorts , 2018, British Medical Journal.

[16]  Bonnie Berger,et al.  Enhancing Evolutionary Couplings with Deep Convolutional Neural Networks , 2017, Cell systems.

[17]  Andriy Kryshtafovych,et al.  Assessment of contact predictions in CASP12: Co‐evolution and deep learning coming of age , 2017, Proteins.

[18]  Julien Mairal,et al.  Invariance and Stability of Deep Convolutional Representations , 2017, NIPS.

[19]  Jie Hou,et al.  DNCON2: improved protein contact prediction using two-level deep convolutional neural networks , 2017, bioRxiv.

[20]  Sheng Wang,et al.  Analysis of deep learning methods for blind protein contact prediction in CASP12 , 2017, bioRxiv.

[21]  Yizhou Yu,et al.  Folding membrane proteins by deep transfer learning , 2017, bioRxiv.

[22]  Benjamin Recht,et al.  Convolutional Kitchen Sinks for Transcription Factor Binding Site Prediction , 2017, 1706.00125.

[23]  Moustapha Cissé,et al.  Parseval Networks: Improving Robustness to Adversarial Examples , 2017, ICML.

[24]  D. Postma,et al.  Missing heritability: is the gap closing? An analysis of 32 complex traits in the Lifelines Cohort Study , 2017, European Journal of Human Genetics.

[25]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[26]  Christopher R. Gignoux,et al.  Human demographic history impacts genetic risk prediction across diverse populations , 2016, bioRxiv.

[27]  Karen L. Mohlke,et al.  The Metabolic Syndrome in Men study: a resource for studies of metabolic and cardiovascular diseases , 2017, Journal of Lipid Research.

[28]  Dermot F. Reilly,et al.  Polygenic Risk Score Identifies Subgroup With Higher Burden of Atherosclerosis and Greater Relative Benefit From Statin Therapy in the Primary Prevention Setting , 2017, Circulation.

[29]  Peter Donnelly,et al.  Bayesian analysis of genetic association across tree-structured routine healthcare data in the UK Biobank , 2017, Nature Genetics.

[30]  Yoseph Barash,et al.  Integrative deep models for alternative splicing , 2017, bioRxiv.

[31]  Avanti Shrikumar,et al.  Reverse-complement parameter sharing improves deep learning models for genomics , 2017, bioRxiv.

[32]  Georgios A. Pavlopoulos,et al.  Protein structure determination using metagenome sequence data , 2017, Science.

[33]  Maria Jesus Martin,et al.  Uniclust databases of clustered and deeply annotated protein sequences and alignments , 2016, Nucleic Acids Res..

[34]  Beilun Wang,et al.  Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks , 2016, PSB.

[35]  Anshul Kundaje,et al.  Denoising genome-wide histone ChIP-seq with convolutional neural networks , 2016, bioRxiv.

[36]  W. Willett,et al.  Breast Cancer Risk From Modifiable and Nonmodifiable Risk Factors Among White Women in the United States. , 2016, JAMA oncology.

[37]  Maxime Déraspe,et al.  Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons , 2016, BMC Genomics.

[38]  Zhen Li,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016, bioRxiv.

[39]  Zhiyong Wang,et al.  Knowledge-based machine learning methods for macromolecular 3D structure prediction , 2016, 1609.05061.

[40]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[41]  Julien Mairal,et al.  End-to-End Kernel Learning with Supervised Convolutional Kernel Networks , 2016, NIPS.

[42]  Wei Li,et al.  RaptorX-Property: a web server for protein structure property prediction , 2016, Nucleic Acids Res..

[43]  F. Dudbridge Polygenic Epidemiology , 2016, Genetic epidemiology.

[44]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jian Peng,et al.  Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields , 2015, Scientific Reports.

[46]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[47]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[48]  D. Balding,et al.  Using Genetic Distance to Infer the Accuracy of Genomic Prediction , 2015, PLoS genetics.

[49]  R Plomin,et al.  Phenome-wide analysis of genome-wide polygenic scores , 2015, Molecular Psychiatry.

[50]  T. Lehtimäki,et al.  Integrative approaches for large-scale transcriptome-wide association studies , 2015, Nature Genetics.

[51]  Janusz M. Bujnicki,et al.  GDFuzz3D: a method for protein 3D structure reconstruction from contact maps, based on a non-Euclidean distance function , 2015, Bioinform..

[52]  P. Visscher,et al.  Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores , 2015, bioRxiv.

[53]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[54]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[55]  Jianlin Cheng,et al.  CONFOLD: Residue‐residue contact‐guided ab initio protein folding , 2015, Proteins.

[56]  Kaanan P. Shah,et al.  A gene-based association method for mapping traits using reference transcriptome data , 2015, Nature Genetics.

[57]  Kaanan P. Shah,et al.  PrediXcan: Trait Mapping Using Human Transcriptome Regulation , 2015, bioRxiv.

[58]  B. Berger,et al.  Two variance component model improves genetic prediction in family data sets , 2015, bioRxiv.

[59]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[60]  David T. Jones,et al.  MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins , 2014, Bioinform..

[61]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[62]  B. Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014, Nature Genetics.

[63]  Zhiyong Wang,et al.  Protein contact prediction by integrating joint evolutionary coupling analysis and supervised learning , 2013, Bioinform..

[64]  Markus Gruber,et al.  CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations , 2014, Bioinform..

[65]  Guo-Bo Chen,et al.  Estimating heritability of complex traits from genome-wide association studies using IBS-based Haseman–Elston regression , 2014, Front. Genet..

[66]  Carlo Baldassi,et al.  Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners , 2014, PloS one.

[67]  P. Visscher,et al.  Advantages and pitfalls in the application of mixed-model association methods , 2014, Nature Genetics.

[68]  Genetic Prediction of Quantitative Lipid Traits: Comparing Shrinkage Models to Gene Scores , 2014, Genetic epidemiology.

[69]  Manolis Kellis,et al.  Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments , 2013, Nucleic acids research.

[70]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[71]  Giuseppe Tradigo,et al.  Toward an accurate prediction of inter-residue distances in proteins using 2D recursive neural networks , 2014, BMC Bioinformatics.

[72]  David Baker,et al.  High-resolution comparative modeling with RosettaCM. , 2013, Structure.

[73]  D. Baker,et al.  Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era , 2013, Proceedings of the National Academy of Sciences.

[74]  S. Thompson,et al.  Use of allele scores as instrumental variables for Mendelian randomization , 2013, International journal of epidemiology.

[75]  N. Patterson,et al.  Using Extended Genealogy to Estimate Components of Heritability for 23 Quantitative and Dichotomous Traits , 2013, PLoS genetics.

[76]  A. Valencia,et al.  Emerging methods in protein co-evolution , 2013, Nature Reviews Genetics.

[77]  Jianzhu Ma,et al.  Protein structure alignment beyond spatial proximity , 2013, Scientific Reports.

[78]  Xiang Zhou,et al.  Polygenic Modeling with Bayesian Sparse Linear Mixed Models , 2012, PLoS genetics.

[79]  Oliver Stegle,et al.  A Lasso multi-marker mixed model for association mapping with population structure correction , 2013, Bioinform..

[80]  Jianlin Cheng,et al.  Predicting protein residue-residue contacts using deep networks and boosting , 2012, Bioinform..

[81]  Thomas A. Hopf,et al.  Protein structure prediction from sequence variation , 2012, Nature Biotechnology.

[82]  Alexander J. Stewart,et al.  Why Transcription Factor Binding Sites Are Ten Nucleotides Long , 2012, Genetics.

[83]  D. Altshuler,et al.  Informed Conditioning on Clinical Covariates Increases Power in Case-Control Association Studies , 2012, PLoS genetics.

[84]  Sang Hong Lee,et al.  Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood , 2012, Bioinform..

[85]  Tatiana I Axenovich,et al.  Rapid variance components–based method for whole-genome association analysis , 2012, Nature Genetics.

[86]  Pierre Baldi,et al.  Deep architectures for protein contact map prediction , 2012, Bioinform..

[87]  Peter Kraft,et al.  Analysis of case-control association studies with known risk variants , 2012, Bioinform..

[88]  Jinbo Xu,et al.  A position-specific distance-dependent statistical potential for protein structure and functional study. , 2012, Structure.

[89]  Jian Peng,et al.  A conditional neural fields model for protein threading , 2012, Bioinform..

[90]  Eleazar Eskin,et al.  Improved linear mixed models for genome-wide association studies , 2012, Nature Methods.

[91]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[92]  Nicholas P. Schafer,et al.  AWSEM-MD: protein structure prediction using coarse-grained physical potentials and bioinformatically based local structure biasing. , 2012, Journal of Physical Chemistry B.

[93]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[94]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[95]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[96]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[97]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[98]  Martin J. Wainwright,et al.  Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions , 2011, ICML.

[99]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[100]  David Heckerman,et al.  Correction for hidden confounders in the genetic analysis of gene expression , 2010, Proceedings of the National Academy of Sciences.

[101]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[102]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[103]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[104]  Yang Zhang,et al.  I-TASSER: a unified platform for automated protein structure and function prediction , 2010, Nature Protocols.

[105]  Jaap Heringa,et al.  Protein secondary structure prediction. , 2010, Methods in molecular biology.

[106]  G. McVean A Genealogical Interpretation of Principal Components Analysis , 2009, PLoS genetics.

[107]  Andrzej Kloczkowski,et al.  Distance matrix-based approach to protein structure prediction , 2009, Journal of Structural and Functional Genomics.

[108]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[109]  Vladimir Pavlovic,et al.  Scalable Algorithms for String Kernels with Inexact Matching , 2008, NIPS.

[110]  Gunnar Rätsch,et al.  Support Vector Machines and Kernels for Computational Biology , 2008, PLoS Comput. Biol..

[111]  D. Heckerman,et al.  Efficient Control of Population Structure in Model Organism Association Mapping , 2008, Genetics.

[112]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[113]  A. Brunger Version 1.2 of the Crystallography and NMR system , 2007, Nature Protocols.

[114]  Peter M Visscher,et al.  Prediction of individual genetic risk to disease from genome-wide association studies. , 2007, Genome research.

[115]  C. Haley,et al.  GRAMMAR: a fast and simple method for genome-wide pedigree-based quantitative trait loci association analysis , 2007 .

[116]  Klaus Obermayer,et al.  Fast model-based protein homology detection without alignment , 2007, Bioinform..

[117]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[118]  Tony Håndstad,et al.  Motif kernel generated by genetic programming improves remote homology and fold detection , 2007, BMC Bioinformatics.

[119]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[120]  George Karypis,et al.  Profile-based direct kernels for remote homology detection and fold recognition , 2005, Bioinform..

[121]  Michael Nilges,et al.  Modeling errors in NOE data with a log-normal distribution improves the quality of NMR structures. , 2005, Journal of the American Chemical Society.

[122]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[123]  D. Balding,et al.  A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity , 2005, Genetica.

[124]  Ke Wang,et al.  Profile-based string kernels for remote homology detection and motif extraction , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[125]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[126]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[127]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[128]  David Baker,et al.  Protein structure prediction and analysis using the Robetta server , 2004, Nucleic Acids Res..

[129]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[130]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[131]  Marc A. Martí-Renom,et al.  Tools for comparative protein structure modeling and analysis , 2003, Nucleic Acids Res..

[132]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[133]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[134]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[135]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[136]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[137]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[138]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[139]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[140]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[141]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[142]  W. Taylor,et al.  Global fold determination from a small number of distance restraints. , 1995, Journal of molecular biology.

[143]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[144]  G. Robinson That BLUP is a Good Thing: The Estimation of Random Effects , 1991 .

[145]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[146]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[147]  W L Haskell,et al.  Regional adiposity patterns in relation to lipids, lipoprotein cholesterol, and lipoprotein subfraction mass in men. , 1989, The Journal of clinical endocrinology and metabolism.

[148]  Lorien Y. Pratt,et al.  Comparing Biases for Minimal Network Construction with Back-Propagation , 1988, NIPS.

[149]  H. D. Patterson,et al.  Recovery of inter-block information when block sizes are unequal , 1971 .