Novel algorithms for efficient subsequence searching and mapping in nanopore raw signals towards targeted sequencing

ABSTRACT Genome diagnostics have gradually become a prevailing routine for human healthcare. With the advances in understanding the causal genes for many human diseases, targeted sequencing provides a rapid, cost-efficient and focused option for clinical applications, such as SNP detection and haplotype classification, in a specific genomic region. Although nanopore sequencing offers a perfect tool for targeted sequencing because of its mobility, PCR-freeness, and long read properties, it poses a challenging computational problem of how to efficiently and accurately search and map genomic subsequences of interest in a pool of nanopore reads (or raw signals). Due to its relatively low sequencing accuracy, there is no reliable solution to this problem, especially at low sequencing coverage. Here, we propose a brand new signal-based subsequence inquiry pipeline as well as two novel algorithms to tackle this problem. The proposed algorithms follow the principle of subsequence dynamic time warping and directly operate on the electrical current signals, without loss of information in base-calling. Therefore, the proposed algorithms can serve as a tool for sequence inquiry in targeted sequencing. Two novel criteria are offered for the consequent signal quality analysis and data classification. Comprehensive experiments on real-world nanopore datasets show the efficiency and effectiveness of the proposed algorithms. We further demonstrate the potential applications of the proposed algorithms in two typical tasks in nanopore-based targeted sequencing: SNP detection under low sequencing coverage, and haplotype classification under low sequencing accuracy.

[1]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[2]  Rudolf Jaenisch,et al.  DGCR8 is essential for microRNA biogenesis and silencing of embryonic stem cell self-renewal , 2007, Nature Genetics.

[3]  C. Torrence,et al.  A Practical Guide to Wavelet Analysis. , 1998 .

[4]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[5]  David A. Matthews,et al.  Real-time, portable genome sequencing for Ebola surveillance , 2016, Nature.

[6]  Peter H. L. Krijger,et al.  Targeted sequencing by proximity ligation for comprehensive variant detection and local haplotyping , 2014, Nature Biotechnology.

[7]  Antonino Fiannaca,et al.  Deep learning models for bacteria taxonomic classification of metagenomic data , 2018, BMC Bioinformatics.

[8]  Doron Lancet,et al.  Haplotype structure and selection of the MDM2 oncogene in humans , 2007, Proceedings of the National Academy of Sciences.

[9]  Stéphanie Baert-Desurmont,et al.  The MDM2 285G–309G haplotype is associated with an earlier age of tumour onset in patients with Li-Fraumeni syndrome , 2013, Familial Cancer.

[10]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[11]  D. Branton,et al.  Three decades of nanopore sequencing , 2016, Nature Biotechnology.

[12]  A.J. Viterbi A personal history of the Viterbi algorithm , 2006, IEEE Signal Processing Magazine.

[13]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[14]  Jacob L. Mueller,et al.  CRISPR-mediated isolation of specific megabase segments of genomic DNA , 2017, Nucleic acids research.

[15]  W. Kloosterman,et al.  From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy , 2018, Genome Biology.

[16]  Stan Salvador,et al.  FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space , 2004 .

[17]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[18]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017, Nature Methods.

[19]  Hugh E. Olsen,et al.  The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community , 2016, Genome Biology.

[20]  Renmin Han,et al.  An accurate and rapid continuous wavelet dynamic time warping algorithm for end‐to‐end mapping in ultra‐long nanopore sequencing , 2018, Bioinform..

[21]  Meinard Müller,et al.  Memory-restricted multiscale dynamic time warping , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[23]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[24]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[25]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[26]  Martin Vitek,et al.  Progressive alignment of genomic signals by multiple dynamic time warping. , 2015, Journal of theoretical biology.

[27]  Niranjan Nagarajan,et al.  Fast and sensitive mapping of nanopore sequencing reads with GraphMap , 2016, Nature Communications.

[28]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[29]  Eamonn J. Keogh,et al.  Three Myths about Dynamic Time Warping Data Mining , 2005, SDM.

[30]  Christos Faloutsos,et al.  Stream Monitoring under the Time Warping Distance , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[31]  Ji Eun Lee,et al.  De novo Identification of DNA Modifications Enabled by Genome-Guided Nanopore Signal Processing , 2017, bioRxiv.

[32]  Alberto Magi,et al.  Nanopore sequencing data analysis: state of the art, applications and challenges , 2017, Briefings Bioinform..

[33]  Renmin Han,et al.  DeepSimulator: a deep simulator for Nanopore sequencing , 2017, bioRxiv.

[34]  Edwin Cuppen,et al.  Mapping and phasing of structural variation in patient genomes using nanopore sequencing , 2017, Nature Communications.