conLSH: Context based Locality Sensitive Hashing for Mapping of noisy SMRT Reads

Single Molecule Real-Time (SMRT) sequencing is a recent advancement of Next Gen technology developed by Pacific Bio (PacBio). It comes with an explosion of long and noisy reads demanding cutting edge research to get most out of it. To deal with the high error probability of SMRT data, a novel contextual Locality Sensitive Hashing (conLSH) based algorithm is proposed in this article, which can effectively align the noisy SMRT reads to the reference genome. Here, sequences are hashed together based not only on their closeness, but also on similarity of context. The algorithm has space requirement, where n is the number of sequences in the corpus and ρ is a constant. The indexing time and querying time are bounded by and respectively, where P2 > 0, is a probability value. This algorithm is particularly useful for retrieving similar sequences, a widely used task in biology. The proposed conLSH based aligner is compared with rHAT, popularly used for aligning SMRT reads, and is found to comprehensively beat it in speed as well as in memory requirements. In particular, it takes approximately 24.2% less processing time, while saving about 70.3% in peak memory requirement for H.sapiens PacBio dataset.

[1]  Thomas Gottron,et al.  Locality sensitive hashing for scalable structural classification and clustering of web documents , 2013, CIKM.

[2]  Valerio Pascucci,et al.  Slow Growing Subdivision (SGS) in Any Dimension: Towards Removing the Curse of Dimensionality , 2002, Comput. Graph. Forum.

[3]  A. Ameur,et al.  Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics , 2018, Nucleic acids research.

[4]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[5]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[6]  Martial Hebert,et al.  Rapid object indexing using locality sensitive hashing and joint 3D-signature space estimation , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[8]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[9]  Jeremy Buhler,et al.  Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[10]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[11]  Prateek Jain,et al.  Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Sanghamitra Bandyopadhyay,et al.  A Layered Locality Sensitive Hashing based Sequence Similarity Search Algorithm for Web Sessions , 2014 .

[13]  Qi Tian,et al.  Batch-Orthogonal Locality-Sensitive Hashing for Angular Similarity , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Rong Jin,et al.  Boosting multi-kernel locality-sensitive hashing for scalable image retrieval , 2012, SIGIR '12.

[15]  Yan Ke,et al.  Efficient Near-duplicate Detection and Sub-image Retrieval , 2004 .

[16]  J. Landolin,et al.  Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing , 2014 .

[17]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[18]  Kin-Fan Au,et al.  PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[19]  Sanghamitra Bandyopadhyay,et al.  Ultrafast Genomic Database Search using Layered Locality Sensitive Hashing , 2018, 2018 Fifth International Conference on Emerging Applications of Information Technology (EAIT).

[20]  Chengxi Ye,et al.  Distributed under Creative Commons Cc-by 4.0 Sparc: a Sparsity-based Consensus Algorithm for Long Erroneous Sequencing Reads , 2022 .

[21]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[22]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[23]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Lucian Ilie,et al.  HISEA: HIerarchical SEed Aligner for PacBio data , 2017, BMC Bioinformatics.

[25]  Yadong Wang,et al.  rHAT: fast alignment of noisy long reads with regional hashing , 2016, Bioinform..

[26]  Matthieu Cord,et al.  Locality-Sensitive Hashing for Chi2 Distance , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Anssi Klapuri,et al.  Query by humming of midi and audio using locality sensitive hashing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Hung T. Nguyen,et al.  Application and Theory of Random Sets. , 1997 .

[29]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[30]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[31]  Chengxi Ye,et al.  DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies , 2014, Scientific Reports.

[32]  Richard J. Roberts,et al.  The advantages of SMRT sequencing , 2013, Genome Biology.