论文信息 - Locality-sensitive hashing for the edit distance - 字舞流文

Locality-sensitive hashing for the edit distance

Abstract Motivation Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy. Results We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH. Availability and implementation The code to generate the results is available at http://github.com/Kingsford-Group/omhismb2019. Supplementary information Supplementary data are available at Bioinformatics online.

Prashant Pandey | Carl Kingsford | Guillaume Marçais | Dan F. DeBlasio

[1] Piotr Indyk,et al. Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[2] Eugene W. Myers,et al. A whole-genome assembly of Drosophila. , 2000, Science.

[3] P. Diaconis,et al. Longest increasing subsequences: from patience sorting to the Baik-Deift-Johansson theorem , 1999 .

[4] Andrew Zisserman,et al. Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[5] E. Mauceli,et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. , 2003, Genome research.

[6] Steven L Salzberg,et al. Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[7] E. Popodi,et al. Insertion sequence-caused large-scale rearrangements in the genome of Escherichia coli , 2016, Nucleic acids research.

[8] Adam M. Phillippy,et al. MUMmer4: A fast and versatile genome alignment system , 2018, PLoS Comput. Biol..

[9] Tyler Moore,et al. Polymorphic malware detection using sequence classification methods and ensembles , 2017, EURASIP J. Inf. Secur..

[10] Chirag Jain,et al. A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases , 2017, RECOMB.

[11] Rafail Ostrovsky,et al. Low distortion embeddings for edit distance , 2007, JACM.

[12] Michael L. Fredman,et al. On computing the length of longest increasing subsequences , 1975, Discret. Math..

[13] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[14] Richard Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[15] Gabor T. Marth,et al. SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications , 2012, PloS one.

[16] Siu-Ming Yiu,et al. SOAP3: ultra-fast GPU-based parallel alignment tool for short reads , 2012, Bioinform..

[17] J. Landolin,et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[18] Rafail Ostrovsky,et al. Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[19] Thomas G. Szymanski,et al. A fast algorithm for computing longest common subsequences , 1977, CACM.

[20] Brian D. Ondov,et al. Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.