A framework and an algorithm to detect low‐abundance DNA by a handy sequencer and a palm‐sized computer

Motivation: Detection of DNA at low abundance with respect to the entire sample is an important problem in areas such as epidemiology and field research, as these samples are highly contaminated with non‐target DNA. To solve this problem, many methods have been developed to date, but all require additional time‐consuming and costly procedures. Meanwhile, the MinION sequencer developed by Oxford Nanopore Technology (ONT) is considered a powerful tool for tackling this problem, as it allows selective sequencing of target DNA. The main technology employed involves rejection of an undesirable read from a specific pore by inverting the voltage of that pore, which is referred to as ‘Read Until’. Despite its usefulness, several issues remain to be solved in real situations. First, limited computational resources are available in field research and epidemiological applications. In addition, a high‐speed online classification algorithm is required to make a prompt decision. Lastly, the lack of a theoretical approach for modeling of selective sequencing makes it difficult to analyze and justify a given algorithm. Results: In this paper, we introduced a statistical model of selective sequencing, proposed an efficient constant‐time classifier for any background DNA profile, and validated its optimal precision. To confirm the feasibility of the proposed method in practice, for a pre‐recorded mock sample, we demonstrate that the method can selectively sequence a 100 kb region, consisting of 0.1% of the entire read pool, and achieve approximately 500‐fold amplification. Furthermore, the algorithm is shown to process 26 queries per second with a $500 palm‐sized next unit of computing box using an Intel® CoreTMi7 CPU without extended computer resources such as a GPU or high‐performance computing. Next, we prepared a mixed DNA pool composed of Saccharomyces cerevisiae and lambda phage, in which any 200 kb region of S.cerevisiae consists of 0.1% of the whole sample. From this sample, a 30‐230 kb region of S.cerevisiae chromosome 1 was amplified approximately 30‐fold. In addition, this method allowed on‐the‐fly changing of the amplified region according to the uncovered characteristics of a given DNA sample. Availability and implementation: The source code is available at: https://bitbucket.org/ban‐m/dyss.

[1]  S. Turner,et al.  Zero-Mode Waveguides for Single-Molecule Analysis at High Concentrations , 2003, Science.

[2]  S. Reynolds,et al.  Lost in plasmids: next generation sequencing and the complex genome of the tick-borne pathogen Borrelia burgdorferi , 2017, BMC Genomics.

[3]  Ronald L. Rivest,et al.  The Algorithm SELECT - for Finding the ith Smallest of n Elements [M1] (Algorithm 489) , 1975, Commun. ACM.

[4]  D. Helinski,et al.  Bacterial Plasmids , 1973, Science.

[5]  Eamonn J. Keogh,et al.  The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances , 2016, Data Mining and Knowledge Discovery.

[6]  Frederic D Bushman,et al.  Rapid evolution of the human gut virome , 2013, Proceedings of the National Academy of Sciences.

[7]  Tarjei S Mikkelsen,et al.  Enhanced methods for unbiased deep sequencing of Lassa and Ebola RNA viruses from clinical and biological samples , 2014, Genome Biology.

[8]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[9]  P. Hugenholtz,et al.  Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes , 2013, Nature Biotechnology.

[10]  D. Branton,et al.  Rapid nanopore discrimination between single polynucleotide molecules. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Gail L. Rosen,et al.  Nanopore Sequencing in Microgravity , 2015 .

[12]  Nuno R. Faria,et al.  Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples , 2017, Nature Protocols.

[13]  Lubos Klucar,et al.  viruSITE—integrated database for viral genomics , 2016, Database J. Biol. Databases Curation.

[14]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[15]  Kin-Fan Au,et al.  PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[16]  Eamonn J. Keogh,et al.  Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping , 2012, KDD.

[17]  Matthew Loose,et al.  Real-time selective sequencing using nanopore technology , 2016, Nature Methods.

[18]  A. Krarup,et al.  A fast and robust method for whole genome sequencing of the Aleutian Mink Disease Virus (AMDV) genome. , 2016, Journal of virological methods.

[19]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[20]  RobertsMichael,et al.  Reducing storage requirements for biological sequence comparison , 2004 .

[21]  Jonas Korlach,et al.  Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nanostructures , 2008, Proceedings of the National Academy of Sciences.

[22]  Giovanini Evelim Coelho,et al.  Zika virus in the Americas: Early epidemiological and genetic findings , 2016, Science.

[23]  Philip Chan,et al.  Toward accurate dynamic time warping in linear time and space , 2007, Intell. Data Anal..

[24]  Minh Duc Cao,et al.  Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinIONTM sequencing , 2015, bioRxiv.

[25]  Germain Forestier,et al.  Judicious setting of Dynamic Time Warping's window width allows more accurate classification of time series , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[26]  I. Mizrahi,et al.  A method for purifying high quality and high yield plasmid DNA for metagenomic and deep sequencing approaches. , 2013, Journal of microbiological methods.

[27]  E. Zaikova,et al.  Real-Time DNA Sequencing in the Antarctic Dry Valleys Using the Oxford Nanopore Sequencer. , 2017, Journal of biomolecular techniques : JBT.

[28]  Nan Ye,et al.  Optimizing F-measure: A Tale of Two Approaches , 2012, ICML.

[29]  A. Djikeng,et al.  Viral genome sequencing by random priming methods , 2008 .

[30]  Doug Stryke,et al.  Rapid metagenomic identification of viral pathogens in clinical samples by real-time nanopore sequencing analysis , 2015, Genome Medicine.

[31]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[32]  Kristine M Wylie,et al.  Enhanced virome sequencing using targeted sequence capture , 2015, Genome research.

[33]  David Tse,et al.  Information theory for DNA sequencing: Part I: A basic model , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[34]  Cyrus Rashtchian,et al.  Random access in large-scale DNA data storage , 2018, Nature Biotechnology.

[35]  Chen Luo,et al.  SSH (Sketch, Shingle, & Hash) for Indexing Massive-Scale Time Series , 2016, NIPS Time Series Workshop.

[36]  David A. Matthews,et al.  Real-time, portable genome sequencing for Ebola surveillance , 2016, Nature.

[37]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[38]  Charles Elkan,et al.  Optimal Thresholding of Classifiers to Maximize F1 Measure , 2014, ECML/PKDD.