A faster and more accurate heuristic for cyclic edit distance computation

This letter describes a new heuristic algorithm to compute the cyclic edit distance.We extend an existing algorithm which compares circular sequences using q-grams.Theoretical insight to support the suitability of the algorithm is provided.Experiments show the heuristic is more accurate compared to existing heuristics.Experiments show the heuristic is faster compared to existing heuristics. Sequence comparison is the core computation of many applications involving textual representations of data. Edit distance is the most widely used measure to quantify the similarity of two sequences. Edit distance can be defined as the minimal total cost of a sequence of edit operations to transform one sequence into the other; for a sequence x of length m and a sequence y of length n, it can be computed in time O(mn). In many applications, it is common to consider sequences with circular structure: for instance, the orientation of two images or the leftmost position of two linearised circular DNA sequences may be irrelevant. To this end, an algorithm to compute the cyclic edit distance in time O(mnlogm) was proposed (Maes, 2003 [18]) and several heuristics have been proposed to speed up this computation. Recently, a new algorithm based on q-grams was proposed for circular sequence comparison (Grossi etal., 2016 [13]). We extend this algorithm for cyclic edit distance computation and show that this new heuristic is faster and more accurate than the state of the art. The aim of this letter is to give visibility to this idea in the pattern recognition community.

[1]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[2]  Sergio Barrachina,et al.  Speeding up the computation of the edit distance for cyclic strings , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[3]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[4]  M. Crochemore,et al.  Algorithms on Strings: Tools , 2007 .

[5]  Costas S. Iliopoulos,et al.  Average-Case Optimal Approximate Circular String Matching , 2014, LATA.

[6]  Horst Bunke,et al.  Applications of approximate string matching to 2D shape recognition , 1993, Pattern Recognit..

[7]  Costas S. Iliopoulos,et al.  Fast algorithms for approximate circular string matching , 2014, Algorithms for Molecular Biology.

[8]  Costas S. Iliopoulos,et al.  Accurate and Efficient Methods to Improve Multiple Circular Sequence Alignment , 2015, SEA.

[9]  J. Todd Book Review: Digital image processing (second edition). By R. C. Gonzalez and P. Wintz, Addison-Wesley, 1987. 503 pp. Price: £29.95. (ISBN 0-201-11026-1) , 1988 .

[10]  Andrés Marzal,et al.  Speeding up the cyclic edit distance using LAESA with early abandon , 2015, Pattern Recognit. Lett..

[11]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[12]  Herbert Freeman,et al.  On the Encoding of Arbitrary Geometric Configurations , 1961, IRE Trans. Electron. Comput..

[13]  M. Maes,et al.  On a Cyclic String-To-String Correction Problem , 1990, Inf. Process. Lett..

[14]  Andrés Marzal,et al.  Dynamic Time Warping of Cyclic Strings for Shape Matching , 2005, ICAPR.

[15]  Our Correspondent in Molecular Biology Circular DNA , 1967, Nature.

[16]  Roberto Grossi,et al.  Circular sequence comparison: algorithms and applications , 2016, Algorithms for Molecular Biology.

[17]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[18]  Thomas Sikora,et al.  The MPEG-7 visual standard for content description-an overview , 2001, IEEE Trans. Circuits Syst. Video Technol..

[19]  Esko Ukkonen,et al.  Two Algorithms for Approximate String Matching in Static Texts , 1991, MFCS.

[20]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[21]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[22]  Costas S. Iliopoulos,et al.  Fast circular dictionary-matching algorithm , 2015, Mathematical Structures in Computer Science.

[23]  Andrés Marzal,et al.  On the dynamic time warping of cyclic sequences for shape retrieval , 2012, Image Vis. Comput..

[24]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[25]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[26]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[27]  David J Craik,et al.  Thematic Minireview Series on Circular Proteins , 2012, The Journal of Biological Chemistry.

[28]  Francisco Casacuberta,et al.  Cyclic Sequence Alignments: Approximate Versus Optimal Techniques , 2002, Int. J. Pattern Recognit. Artif. Intell..

[29]  Enrique Vidal,et al.  Computation of Normalized Edit Distance and Applications , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Francisco Casacuberta,et al.  Efficient Techniques for a Very Accurate Measurement of Dissimilarities between Cyclic Patterns , 2000, SSPR/SPR.