Models and information-theoretic bounds for nanopore sequencing

Nanopore sequencing is an emerging new technology for sequencing DNA, which can read long fragments of DNA (∼50,000 bases) unlike most current sequencers which can only read hundreds of bases. While nanopore sequencers can acquire long reads, the high error rates (≈ 30%) pose a technical challenge. In a nanopore sequencer, a DNA is migrated through a nanopore and current variations are measured. The DNA sequence is inferred from this observed current pattern using an algorithm called a base-caller. In this paper, we propose a mathematical model for the “channel” from the input DNA sequence to the observed current, and calculate bounds on the information extraction capacity of the nanopore sequencer. This model incorporates impairments like inter-symbol interference, deletions, as well as random response. The practical application of such information bounds is two-fold: (1) benchmarking present base-calling algorithms, and (2) offering an optimization objective for designing better nanopore sequencers.

[1]  M. Niederweis,et al.  Reading DNA at single-nucleotide resolution with a mutant MspA nanopore and phi29 DNA polymerase , 2012, Nature Biotechnology.

[2]  Kayvon Mazooji,et al.  Shannon: An Information-Optimal de Novo RNA-Seq Assembler , 2016, bioRxiv.

[3]  Sergey M Bezrukov,et al.  On 'three decades of nanopore sequencing' , 2016, Nature Biotechnology.

[4]  David Tse,et al.  Near-optimal assembly for shotgun sequencing with noisy reads , 2014, BMC Bioinformatics.

[5]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[6]  Florham Park,et al.  On Transmission Over Deletion Channels , 2001 .

[7]  Sergio Verdú,et al.  A general formula for channel capacity , 1994, IEEE Trans. Inf. Theory.

[8]  Suhas N. Diggavi,et al.  Models and Information-Theoretic Bounds for Nanopore Sequencing , 2017, IEEE Transactions on Information Theory.

[9]  David Heckerman,et al.  A Hexanucleotide Repeat Expansion in C9ORF72 Is the Cause of Chromosome 9p21-Linked ALS-FTD , 2011, Neuron.

[10]  Wojciech Szpankowski,et al.  Fundamental Bounds for Sequence Reconstruction From Nanopore Sequencers , 2016, IEEE Transactions on Molecular, Biological and Multi-Scale Communications.

[11]  Suhas Diggavi,et al.  On transmission over deletion channels , 2001 .

[12]  D. Branton,et al.  The potential and challenges of nanopore sequencing , 2008, Nature Biotechnology.

[13]  Pascal O. Vontobel,et al.  An upper bound on the capacity of channels with memory and constraint input , 2001, Proceedings 2001 IEEE Information Theory Workshop (Cat. No.01EX494).

[14]  Suhas N. Diggavi,et al.  On information transmission over a finite buffer channel , 2000, IEEE Transactions on Information Theory.

[15]  Doug Stryke,et al.  Rapid metagenomic identification of viral pathogens in clinical samples by real-time nanopore sequencing analysis , 2015, Genome Medicine.

[16]  Vahid Tarokh,et al.  Bounds on the Capacity of Discrete Memoryless Channels Corrupted by Synchronization and Substitution Errors , 2012, IEEE Transactions on Information Theory.

[17]  Niranjan Nagarajan,et al.  Fast and sensitive mapping of nanopore sequencing reads with GraphMap , 2016, Nature Communications.

[18]  Tomáš Vinař,et al.  DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads , 2016, PloS one.

[19]  Wei Zeng,et al.  On the Information Stability of Channels With Timing Errors , 2006, 2006 IEEE International Symposium on Information Theory.

[20]  Minh Duc Cao,et al.  Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinIONTM sequencing , 2015, bioRxiv.

[21]  Hans-Andrea Loeliger,et al.  A Generalization of the Blahut–Arimoto Algorithm to Finite-State Channels , 2008, IEEE Transactions on Information Theory.

[22]  D. Newton AN INTRODUCTION TO ERGODIC THEORY (Graduate Texts in Mathematics, 79) , 1982 .

[23]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[24]  Ilan Shomorony,et al.  Partial DNA assembly: A rate-distortion perspective , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[25]  Naomi Attar Techniques &applications: A first genome assembly for nanopore sequencing , 2015, Nature Reviews Microbiology.

[26]  Michael Mitzenmacher,et al.  On Lower Bounds for the Capacity of Deletion Channels , 2006, IEEE Transactions on Information Theory.

[27]  A. Singleton,et al.  Rare Structural Variants Disrupt Multiple Genes in Neurodevelopmental Pathways in Schizophrenia , 2008, Science.

[28]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[29]  D. Branton,et al.  Characterization of individual polynucleotide molecules using a membrane channel. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Mahdi Cheraghchi Capacity upper bounds for deletion-type channels , 2018, STOC.

[31]  Dario Fertonani,et al.  Novel bounds on the capacity of binary channels with deletions and substitutions , 2009, 2009 IEEE International Symposium on Information Theory.

[32]  Matei David,et al.  Nanocall: an open source basecaller for Oxford Nanopore sequencing data , 2016, bioRxiv.

[33]  R. Gallager SEQUENTIAL DECODING FOR BINARY CHANNELS WITH NOISE AND SYNCHRONIZATION ERRORS , 1961 .

[34]  Babak Hassibi,et al.  Capacity bounds for certain channels with states and the energy harvesting channel , 2014, 2014 IEEE Information Theory Workshop (ITW 2014).

[35]  Jay Shendure,et al.  Decoding long nanopore sequencing reads of natural DNA , 2014, Nature Biotechnology.

[36]  D. A. Bell,et al.  Information Theory and Reliable Communication , 1969 .

[37]  Yaniv Erlich,et al.  Democratizing DNA Fingerprinting , 2016, bioRxiv.

[38]  Michael Eisenstein,et al.  Oxford Nanopore announcement sets sequencing sector abuzz , 2012, Nature Biotechnology.