Locality-sensitive hashing for the edit distance

Abstract Motivation Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy. Results We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH. Availability and implementation The code to generate the results is available at http://github.com/Kingsford-Group/omhismb2019. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[2]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[3]  P. Diaconis,et al.  Longest increasing subsequences: from patience sorting to the Baik-Deift-Johansson theorem , 1999 .

[4]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[5]  E. Mauceli,et al.  Whole-genome sequence assembly for mammalian genomes: Arachne 2. , 2003, Genome research.

[6]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[7]  E. Popodi,et al.  Insertion sequence-caused large-scale rearrangements in the genome of Escherichia coli , 2016, Nucleic acids research.

[8]  Adam M. Phillippy,et al.  MUMmer4: A fast and versatile genome alignment system , 2018, PLoS Comput. Biol..

[9]  Tyler Moore,et al.  Polymorphic malware detection using sequence classification methods and ensembles , 2017, EURASIP J. Inf. Secur..

[10]  Chirag Jain,et al.  A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases , 2017, RECOMB.

[11]  Rafail Ostrovsky,et al.  Low distortion embeddings for edit distance , 2007, JACM.

[12]  Michael L. Fredman,et al.  On computing the length of longest increasing subsequences , 1975, Discret. Math..

[13]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[14]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[15]  Gabor T. Marth,et al.  SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications , 2012, PloS one.

[16]  Siu-Ming Yiu,et al.  SOAP3: ultra-fast GPU-based parallel alignment tool for short reads , 2012, Bioinform..

[17]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[18]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[19]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[20]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.