Fast and Simple Computations Using Prefix Tables Under Hamming and Edit Distance

In this article, we introduce a new and simple data structure, the prefix table under Hamming distance, and present two algorithms to compute it efficiently: one asymptotically fast; the other very fast on average and in practice. Because the latter approach avoids the computation of global data structures, such as the suffix array and the longest common prefix array, it yields algorithms much faster in practice than existing methods. We show how this data structure can be used to solve two string problems of interest: (a) approximate string matching under Hamming distance; and (b) longest approximate overlap under Hamming distance. Analogously, we introduce the prefix table under edit distance, and present an efficient algorithm for its computation. In the process, we also define the border array under both distance measures, and provide an algorithm for conversion between prefix tables and border arrays.

[1]  Susana Ladra,et al.  Approximate All-Pairs Suffix/Prefix Overlaps , 2010, CPM.

[2]  Karl R. Abrahamson Generalized String Matching , 1987, SIAM J. Comput..

[3]  Gad M. Landau,et al.  Incremental String Comparison , 1998, SIAM J. Comput..

[4]  Johannes Fischer,et al.  Inducing the LCP-Array , 2011, WADS.

[5]  Kuan-Yu Chen,et al.  Finding All Approximate Gapped Palindromes , 2009, ISAAC.

[6]  Marek Karpinski,et al.  Foundations of Computation Theory , 1983 .

[7]  Gonzalo Navarro,et al.  Average-optimal single and multiple approximate string matching , 2004, JEAL.

[8]  William F. Smyth,et al.  Computing Patterns in Strings , 2003 .

[9]  Lucian Ilie,et al.  The longest common extension problem revisited and applications to approximate string searching , 2010, J. Discrete Algorithms.

[10]  Z Galil,et al.  Improved string matching with k mismatches , 1986, SIGA.

[11]  Shu Wang,et al.  New Perspectives on the Prefix Array , 2008, SPIRE.

[12]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[13]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[14]  William F. Smyth,et al.  Prefix Table Construction and Conversion , 2013, IWOCA.

[15]  Esko Ukkonen,et al.  On Approximate String Matching , 1983, FCT.

[16]  Ge Nong,et al.  Linear Suffix Array Construction by Almost Pure Induced-Sorting , 2009, 2009 Data Compression Conference.

[17]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[18]  Michael G. Main,et al.  An O(n log n) Algorithm for Finding All Repetitions in a String , 1984, J. Algorithms.

[19]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[20]  Gad M. Landau,et al.  Construction of Aho Corasick automaton in linear time for integer alphabets , 2006, Inf. Process. Lett..

[21]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[22]  Moshe Lewenstein,et al.  Faster algorithms for string matching with k mismatches , 2000, SODA '00.

[23]  Gad M. Landau,et al.  Efficient string matching in the presence of errors , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).