The Utility of Data Transformation for Alignment, De Novo Assembly and Classification of Short Read Virus Sequences

Advances in DNA sequencing technology are facilitating genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and fully exploit biological sequence data. Comparable analytical challenges are encountered in other data-intensive fields involving sequential data, such as signal processing, in which dimensionality reduction (i.e., compression) methods are routinely used to lessen the computational burden of analyses. In this work, we explored the application of dimensionality reduction methods to numerically represent high-throughput sequence data for three important biological applications of virus sequence data: reference-based mapping, short sequence classification and de novo assembly. Leveraging highly compressed sequence transformations to accelerate sequence comparison, our approach yielded comparable accuracy to existing approaches, further demonstrating its suitability for sequences originating from diverse virus populations. We assessed the application of our methodology using both synthetic and real viral pathogen sequences. Our results show that the use of highly compressed sequence approximations can provide accurate results, with analytical performance retained and even enhanced through appropriate dimensionality reduction of sequence data.

[1]  Niranjan Nagarajan,et al.  Fast and sensitive mapping of nanopore sequencing reads with GraphMap , 2016, Nature Communications.

[2]  E. A. Cheever,et al.  Using signal processing techniques for DNA sequence comparison , 1989, Proceedings of the Fifteenth Annual Northeast Bioengineering Conference.

[3]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[4]  Paul Kellam,et al.  Deep Sequencing of Norovirus Genomes Defines Evolutionary Patterns in an Urban Tropical Setting , 2014, Journal of Virology.

[5]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[6]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[7]  Dimitrios Gunopulos,et al.  Discovering similar multidimensional trajectories , 2002, Proceedings 18th International Conference on Data Engineering.

[8]  Alexandros Nanopoulos,et al.  Time-Series Classification in Many Intrinsic Dimensions , 2010, SDM.

[9]  Eamonn Keogh Exact Indexing of Dynamic Time Warping , 2002, VLDB.

[10]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[11]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[12]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[13]  Kin Fai Au,et al.  A comparative evaluation of hybrid error correction methods for error-prone long reads , 2019, Genome Biology.

[14]  Anders Krogh,et al.  Fast and sensitive taxonomic classification for metagenomics with Kaiju , 2016, Nature Communications.

[15]  F. Mörchen Time series feature extraction for data mining using DWT and DFT , 2003 .

[16]  Eugene W. Myers,et al.  Combinatorial algorithms for DNA sequence assembly , 1995, Algorithmica.

[17]  Ehsan Sheybani An Algorithm for Real-Time Blind Image Quality Comparison and Assessment , 2011 .

[18]  Mick Watson,et al.  Errors in long-read assemblies can critically affect protein prediction , 2019, Nature Biotechnology.

[19]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[20]  Qingpeng Zhang,et al.  These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure , 2013, PloS one.

[21]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[22]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[23]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[24]  Michel Verleysen,et al.  The Curse of Dimensionality in Data Mining and Time Series Prediction , 2005, IWANN.

[25]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[26]  Z. Meral Özsoyoglu,et al.  Indexing large metric spaces for similarity search queries , 1999, TODS.

[27]  David L Robertson,et al.  A signal processing method for alignment-free metagenomic binning: multi-resolution genomic binary patterns , 2019, Scientific Reports.

[28]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[29]  Charu C. Aggarwal,et al.  Data Clustering , 2013 .

[30]  Jay Shendure,et al.  Large-scale genomic sequencing of extraintestinal pathogenic Escherichia coli strains , 2015, Genome research.

[31]  Weiguo Liu,et al.  A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-Enabled Graphics Hardware , 2010, J. Comput. Biol..

[32]  Bernard P. Puc,et al.  An integrated semiconductor device enabling non-optical genome sequencing , 2011, Nature.

[33]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[34]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[35]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[36]  Todd Holden,et al.  ATCG nucleotide fluctuation of Deinococcus radiodurans radiation genes , 2007, SPIE Optical Engineering + Applications.

[37]  Rebecca Rose,et al.  Challenges in the analysis of viral metagenomes , 2016, Virus evolution.

[38]  Pierre Geurts,et al.  Pattern Extraction for Time Series Classification , 2001, PKDD.

[39]  S. Caboche,et al.  Comparison of mapping algorithms used in high-throughput sequencing: application to Ion Torrent data , 2014, BMC Genomics.

[40]  R. Linsker,et al.  A measure of DNA periodicity. , 1986, Journal of theoretical biology.

[41]  P. Kellam,et al.  Local Evolutionary Patterns of Human Respiratory Syncytial Virus Derived from Whole-Genome Sequencing , 2015, Journal of Virology.

[42]  David L. Robertson,et al.  The Evolutionary Analysis of Emerging Low Frequency HIV-1 CXCR4 Using Variants through Time—An Ultra-Deep Approach , 2010, PLoS Comput. Biol..

[43]  T. Clark,et al.  Human Coronavirus NL63 Molecular Epidemiology and Evolutionary Patterns in Rural Coastal Kenya , 2018 .

[44]  Andrew R. Post,et al.  Temporal data mining. , 2008, Clinics in laboratory medicine.

[45]  A. Nair,et al.  A coding measure scheme employing electron-ion interaction pseudopotential (EIIP) , 2006, Bioinformation.

[46]  Gregory Kucherov,et al.  Using cascading Bloom filters to improve the memory usage for de Brujin graphs , 2013, Algorithms for Molecular Biology.

[47]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[48]  Robert H. Shumway,et al.  Time series analysis and its applications : with R examples , 2017 .

[49]  Hon Keung Kwan,et al.  Numerical representation of DNA sequences , 2009, 2009 IEEE International Conference on Electro/Information Technology.

[50]  Joydeep Ghosh,et al.  Data Clustering Algorithms And Applications , 2013 .

[51]  Pedro Mendes,et al.  A Method for Comparing Multivariate Time Series with Different Dimensions , 2013, PloS one.

[52]  Divyakant Agrawal,et al.  A comparison of DFT and DWT based similarity search in time-series databases , 2000, CIKM '00.

[53]  Paul Horton,et al.  A bioinformatician’s guide to the forefront of suffix array construction algorithms , 2014, Briefings Bioinform..

[54]  A. Jensen,et al.  Ripples in Mathematics - The Discrete Wavelet Transform , 2001 .

[55]  Jesper Jensen,et al.  DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement , 2013, DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement.

[56]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[57]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[58]  Daniel P. Miranker,et al.  ADaM: augmenting existing approximate fast matching algorithms with efficient and exact range queries , 2014, BMC Bioinformatics.

[59]  Andrew M Woodward,et al.  Fast automatic registration of images using the phase of a complex wavelet transform: application to proteome gels. , 2004, The Analyst.

[60]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[61]  Mark Kot,et al.  Multidimensional trees, range searching, and a correlation dimension algorithm of reduced complexity , 1989 .

[62]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[63]  David L. Robertson,et al.  De novo assembly of nucleotide sequences in a compressed feature space , 2017, 2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[64]  T Laver,et al.  Assessing the performance of the Oxford Nanopore Technologies MinION , 2015, Biomolecular detection and quantification.

[65]  Peter F. Stadler,et al.  Lacking alignments? The next-generation sequencing mapper segemehl revisited , 2014, Bioinform..

[66]  S. Baker,et al.  Unbiased whole-genome deep sequencing of human and porcine stool samples reveals circulation of multiple groups of rotaviruses and a putative zoonotic infection , 2016, bioRxiv.

[67]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[68]  Paul Kellam,et al.  Rapid outbreak sequencing of Ebola virus in Sierra Leone identifies transmission chains linked to sporadic cases , 2016, Virus evolution.

[69]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[70]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[71]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[72]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[73]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.