An attention-based neural network basecaller for Oxford Nanopore sequencing data

Highly portable Oxford Nanopore sequencer producing long reads in real time at low cost has made many breakthroughts in genomics studies. However, a major limitation of nanopore sequencing is its high errors when deciphering DNA sequences from noisy and complex raw data. Here we develops SACall, an end-to-end basecaller based on convolution layers, transformer self-attention layers and CTC decoder. From the perspective of read accuracy, SACall yields better performance in the benchmark than ONT official basecaller Guppy and Albacore. SACall is an open-source, freely available basecaller, which gives a chance for researchers to train new basecalling models on specific data and basecall Nanopore reads.

[1]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[2]  M. Niederweis,et al.  Reading DNA at single-nucleotide resolution with a mutant MspA nanopore and phi29 DNA polymerase , 2012, Nature Biotechnology.

[3]  C. Dekker,et al.  DNA sequencing with nanopores , 2012, Nature Biotechnology.

[4]  Niranjan Nagarajan,et al.  Fast and sensitive mapping of nanopore sequencing reads with GraphMap , 2016, Nature Communications.

[5]  Aaron M. Streets,et al.  Single-Cell Transcriptional Analysis. , 2017, Annual review of analytical chemistry.

[6]  Tomáš Vinař,et al.  DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads , 2016, PloS one.

[7]  Alexandre Souvorov,et al.  SKESA: strategic k-mer extension for scrupulous assemblies , 2018, Genome Biology.

[8]  Ryan R. Wick,et al.  Performance of neural network basecalling tools for Oxford Nanopore sequencing , 2019, Genome Biology.

[9]  J. Korlach,et al.  De novo assembly and phasing of a Korean human genome , 2016, Nature.

[10]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[11]  Matei David,et al.  Nanocall: an open source basecaller for Oxford Nanopore sequencing data , 2016, bioRxiv.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[14]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[15]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[16]  E. Eichler,et al.  Long-read sequencing and de novo assembly of a Chinese genome , 2016, Nature Communications.

[17]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[18]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[19]  Winston Timp,et al.  Detecting DNA cytosine methylation using nanopore sequencing , 2017, Nature Methods.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Minh Duc Cao,et al.  Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning , 2017, bioRxiv.

[22]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[26]  P. Antal,et al.  Calling Homopolymer Stretches from Raw Nanopore Reads by Analyzing k-mer Dwell Times , 2017 .

[27]  Ji Eun Lee,et al.  De novo Identification of DNA Modifications Enabled by Genome-Guided Nanopore Signal Processing , 2017, bioRxiv.

[28]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.