A multi-task convolutional deep neural network for variant calling in single molecule sequencing

The accurate identification of DNA sequence variants is an important, but challenging task in genomics. It is particularly difficult for single molecule sequencing, which has a per-nucleotide error rate of ~5–15%. Meeting this demand, we developed Clairvoyante, a multi-task five-layer convolutional neural network model for predicting variant type (SNP or indel), zygosity, alternative allele and indel length from aligned reads. For the well-characterized NA12878 human sample, Clairvoyante achieves 99.67, 95.78, 90.53% F1-score on 1KP common variants, and 98.65, 92.57, 87.26% F1-score for whole-genome analysis, using Illumina, PacBio, and Oxford Nanopore data, respectively. Training on a second human sample shows Clairvoyante is sample agnostic and finds variants in less than 2 h on a standard server. Furthermore, we present 3,135 variants that are missed using Illumina but supported independently by both PacBio and Oxford Nanopore reads. Clairvoyante is available open-source (https://github.com/aquaskyline/Clairvoyante), with modules to train, utilize and visualize the model.Single Molecule Sequencing (SMS) technologies generate long but noisy reads data. Here, the authors develop Clairvoyante, a deep neural network-based method for variant calling with SMS reads such as PacBio and ONT data.

[1]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[2]  Heng Li,et al.  Improving SNP discovery by base alignment quality , 2011, Bioinform..

[3]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[4]  Christos Proukakis,et al.  Evaluation of the detection of GBA missense mutations and other variants using the Oxford Nanopore MinION , 2019, Molecular genetics & genomic medicine.

[5]  Mehryar Mohri,et al.  L2 Regularization for Learning Kernels , 2009, UAI.

[6]  Francisco M. De La Vega,et al.  Joint Variant and De Novo Mutation Identification on Pedigrees from High-Throughput Sequencing Data , 2014 .

[7]  Sarah Sandmann,et al.  Evaluating Variant Calling Tools for Non-Matched Next-Generation Sequencing Data , 2017, Scientific Reports.

[8]  Jinyang Zhao,et al.  Genome sequencing of the sweetpotato whitefly Bemisia tabaci MED/Q , 2017, GigaScience.

[9]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[10]  Justin M. Zook Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015 .

[11]  Michael C. Schatz,et al.  16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model , 2017, bioRxiv.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Francesca Giordano,et al.  Oxford Nanopore MinION Sequencing and Genome Assembly , 2016, Genom. Proteom. Bioinform..

[16]  Benjamin Neale,et al.  A synthetic-diploid benchmark for accurate variant calling evaluation , 2018, Nature Methods.

[17]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[18]  Michael C. Schatz,et al.  Skyhawk: An Artificial Neural Network-based discriminator for reviewing clinically significant genomic variants , 2018, bioRxiv.

[19]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[20]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[21]  Thomas Colthurst,et al.  A universal SNP and small-indel variant caller using deep neural networks , 2018, Nature Biotechnology.

[22]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[23]  Fritz J Sedlazeck,et al.  Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[24]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017 .

[25]  A. Wilm,et al.  LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets , 2012, Nucleic acids research.

[26]  Christos Proukakis,et al.  Detection of GBA missense mutations and other variants using the Oxford Nanopore MinION , 2018, bioRxiv.

[27]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[28]  Ümit V. Çatalyürek,et al.  Benchmarking short sequence mapping tools , 2013, BMC Bioinformatics.

[29]  Margaret C. Linak,et al.  Sequence-specific error profile of Illumina sequencers , 2011, Nucleic acids research.

[30]  James T. Robinson,et al.  Variant Review with the Integrative Genomics Viewer. , 2017, Cancer research.

[31]  N. Loman,et al.  A complete bacterial genome assembled de novo using only nanopore sequencing data , 2015, Nature Methods.

[32]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[33]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[34]  O. Hofmann,et al.  VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research , 2016, Nucleic acids research.