Calculating Edit Distance for Large Sets of String Pairs using MapReduce

Given two strings X and Y over a finite alphabet, the edit distance between X and Y , d(X,Y ) is the number of elementary edit operations required to edit X into Y . A dynamic programming algorithm elegantly computes this distance. In this paper, we investigate the parallelization of calculating edit distance for a large set of strings using MapReduce, a popular parallel computing framework. We propose SIM MR and PRE MR algorithms, parallel versions of the dynamic programming solution, and present implementations of these algorithms. We study different cases by varying algorithm parameters, input size and number of parallel nodes, and analytically and experimentally confirm the superiority of our methods over the usual dynamic programming approach. This study demonstrates how MapReduce parallelization opens new avenues of designing for dynamic programming algorithms. Index Terms Edit distance, Levenshtein distance, MapReduce, string manipulation, dynamic programming

[1]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.

[2]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[3]  Atsuyoshi Nakamura,et al.  A practical comparison of edit distance approximation algorithms , 2011, 2011 IEEE International Conference on Granular Computing.

[4]  S. Gubser Time warps , 2008, 0812.5107.

[5]  King-Sun Fu,et al.  Syntactic Pattern Recognition And Applications , 1968 .

[6]  Jie Wei,et al.  Markov Edit Distance , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Horst Bunke,et al.  On a relation between graph edit distance and maximum common subgraph , 1997, Pattern Recognit. Lett..

[8]  Graham A Stephen,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[9]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[10]  Bhavani M. Thuraisingham,et al.  Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce , 2009, CloudCom.

[11]  G. Zanetti,et al.  Parallelizing bioinformatics applications with MapReduce , 2008 .

[12]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[13]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[14]  Fuzhen Zhuang,et al.  A parallel incremental extreme SVM classifier , 2011, Neurocomputing.

[15]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[16]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[17]  Bhavani M. Thuraisingham,et al.  Evolving Insider Threat Detection Stream Mining Perspective , 2013, Int. J. Artif. Intell. Tools.

[18]  Ömer Egecioglu,et al.  An efficient uniform-cost normalized edit distance algorithm , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[19]  Shivani Jain A Comparative Performance Analysis of Approximate String Matching , 2013 .

[20]  Bhavani M. Thuraisingham,et al.  MapReduce-guided scalable compressed dictionary construction for evolving repetitive sequence streams , 2013, 9th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[21]  Luisa Micó,et al.  A contextual normalised edit distance , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[22]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Bhavani M. Thuraisingham,et al.  Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[24]  Xin Yang,et al.  IncMR: Incremental Data Processing Based on MapReduce , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[25]  Enrique Vidal,et al.  Computation of Normalized Edit Distance and Applications , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Robert Krauthgamer,et al.  Approximating edit distance efficiently , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[27]  Pierre-François Marteau,et al.  The extended edit distance metric , 2008, 2008 International Workshop on Content-Based Multimedia Indexing.

[28]  Theodosios Pavlidis,et al.  Optimal Correspondence of String Subsequences , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Karin K. Breitman,et al.  An Architecture for Distributed High Performance Video Processing in the Cloud , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[30]  Bhavani M. Thuraisingham,et al.  Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing , 2011, IEEE Transactions on Knowledge and Data Engineering.