Kipoi: accelerating the community exchange and reuse of predictive models for genomics

Advanced machine learning models applied to large-scale genomics datasets hold the promise to be major drivers for genome science. Once trained, such models can serve as a tool to probe the relationships between data modalities, including the effect of genetic variants on phenotype. However, lack of standardization and limited accessibility of trained models have hampered their impact in practice. To address this, we present Kipoi, a collaborative initiative to define standards and to foster reuse of trained models in genomics. Already, the Kipoi repository contains over 2,000 trained models that cover canonical prediction tasks in transcriptional and post-transcriptional gene regulation. The Kipoi model standard grants automated software installation and provides unified interfaces to apply and interpret models. We illustrate Kipoi through canonical use cases, including model benchmarking, transfer learning, variant effect prediction, and building new models from existing ones. By providing a unified framework to archive, share, access, use, and build on models developed by the community, Kipoi will foster the dissemination and use of machine learning models in genomics.

[1]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[2]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[3]  Zhipeng Jia,et al.  Large scale tissue histopathology image classification, segmentation, and visualization via deep convolutional activation features , 2017, BMC Bioinformatics.

[4]  Robert Petryszak,et al.  ArrayExpress update—simplifying data submissions , 2014, Nucleic Acids Res..

[5]  Michael C. Schatz,et al.  Clairvoyante: a multi-task convolutional deep neural network for variant calling in Single Molecule Sequencing , 2018, bioRxiv.

[6]  Sungroh Yoon,et al.  Deep learning improves prediction of CRISPR–Cpf1 guide RNA activity , 2018, Nature Biotechnology.

[7]  Sven Rahmann,et al.  Genome analysis , 2022 .

[8]  Georg Seelig,et al.  Learning the Sequence Determinants of Alternative Splicing from Millions of Random Sequences , 2015, Cell.

[9]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[10]  Guohui Chuai,et al.  DeepCRISPR: optimized CRISPR guide RNA design by deep learning , 2018, Genome Biology.

[11]  Daniel Quang,et al.  FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data , 2017, bioRxiv.

[12]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[13]  D. Gifford,et al.  Predicting the impact of non-coding variants on DNA methylation , 2016 .

[14]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[16]  Jieping Ye,et al.  Deep convolutional neural networks for annotating gene expression patterns in the mouse brain , 2015, BMC Bioinformatics.

[17]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[18]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[19]  Vladimir B. Bajic,et al.  HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models , 2015, Nucleic Acids Res..

[20]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[21]  Eric Boerwinkle,et al.  In silico prediction of splice-altering single nucleotide variants in the human genome , 2014, Nucleic acids research.

[22]  Anne E Carpenter,et al.  Automating Morphological Profiling with Generic Deep Convolutional Networks , 2016, bioRxiv.

[23]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[24]  Daniel E. Newburger,et al.  Creating a universal SNP and small indel variant caller with deep neural networks , 2016, bioRxiv.

[25]  Renan Valieris,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[26]  R. Gibbs,et al.  Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. , 2015, Human molecular genetics.

[27]  E. Boerwinkle,et al.  dbNSFP v3.0: A One‐Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice‐Site SNVs , 2016, Human mutation.

[28]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[29]  Dongwon Lee,et al.  LS-GKM: a new gkm-SVM for large-scale datasets , 2016, Bioinform..

[30]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[31]  Raffaella Origa,et al.  BETA THALASSEMIA , 2018, The Professional Medical Journal.

[32]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[33]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[34]  Wilfried Haerty,et al.  Genome-wide discovery of human splicing branchpoints , 2015, Genome research.

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  A. Rajadhyaksha Archive , 2008, BioScope: South Asian Screen Studies.

[37]  Gill Bejerano,et al.  A sequence-based, deep learning model accurately predicts RNA splicing branchpoints , 2017, bioRxiv.

[38]  F. Cunningham,et al.  The Ensembl Variant Effect Predictor , 2016, Genome Biology.

[39]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[40]  B. Frey,et al.  The human splicing code reveals new insights into the genetic determinants of disease , 2015, Science.

[41]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.