Performance comparisons of various runs algorithms

This thesis discusses and describes empirical comparisons of execution times of three programs for computing runs in strings. Since two of the programs were thought to be of O(n log n) algorithms (crochB and crochB7) and the third is an implementation of a linear algorithm (runFinder), it was expected that for larger strings runFinder() will strongly outperform the other two programs in the processing of long strings. The aim of this study is thus manifold. We establish the upper limits of lengths of strings for which the performances of crochB and crochB7 are faster or comparable to the performance of runFinder; we also investigate what kind of penalty in performance crochB7 incurs for the memory saving implementation; furthermore, we wish to explore the relative trade-offs of using one technique (represented through the programs with which experimentation was gone about) over another: within what context would it be advantageous to utilize one program over another of those that are being investigated. The motivation for this work is the continuation of work of Franek, Jiang, Smyth, Weng, and Xiao, who implemented a space efficient version of Crochemore’s repetition algorithm [6], and then extended it to compute runs [4, 5]. The three programs tested are: 1. crochB – a direct C++ implementation of the extension of Crochemore’s algorithm for runs by Franek, Jiang, and Weng without any space savings techniques; 2. crochB7 – a space efficient version of crochB by the same authors, 3. runFinder – an efficient C++ implementation by Hideo Bannai from the iii Department of Informatics at Kyushu University in Japan. His implementation utilizes the linear-time strategy of computing the suffix array of the string; using the suffix array it then computes the LCP array; using the suffix and LCP arrays it computes the Lempel-Ziv factorization; from the Lempel-Ziv factorization all leftmost runs are computed using Main’s algorithm; and the rest of the runs are computed using KolpakovKucherov’s algorithm. In this thesis, the three programs are discussed, the experimental setup for the performance measurements is described, the measurements are presented and a brief analysis of the results follows. It will be shown that although an expectation of O(n log n) performance can be expected in the case of processing of one category of investigated data by the latest version of the implementation of the Crochemore program, in some circumstances (discussed), a performance expectation of order n, and in others one between this and one of order n log n will be encountered.

[1]  Michael G. Main,et al.  Detecting leftmost maximal periodicities , 1989, Discret. Appl. Math..

[2]  Frantisek Franek,et al.  An Improved Version of the Runs Algorithm Based on Crochemore's Partitioning Algorithm , 2011, Stringology.

[3]  Chia-Chun Weng Implementing Efficient Algorithms for Computing Runs , 2011 .

[4]  Dong Kyue Kim,et al.  Linear-Time Construction of Suffix Arrays , 2003, CPM.

[5]  Gregory Kucherov,et al.  Finding maximal repetitions in a word in linear time , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[6]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[7]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[8]  William F. Smyth,et al.  Computing Patterns in Strings , 2003 .

[9]  Michael W. Marcellin,et al.  Data Compression Conference (DCC 2010) , 2008 .

[10]  Frantisek Franek,et al.  Crochemore's Repetitions Algorithm Revisited: Computing Runs , 2012, Int. J. Found. Comput. Sci..

[11]  Maxime Crochemore,et al.  An Optimal Algorithm for Computing the Repetitions in a Word , 1981, Inf. Process. Lett..

[12]  Frantisek Franek,et al.  A Note on Crochemore's Repetitions Algorithm - A Fast Space-Efficient Approach , 2003, Nord. J. Comput..

[13]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2003, J. Discrete Algorithms.

[14]  Lucian Ilie,et al.  A Simple Algorithm for Computing the Lempel Ziv Factorization , 2008, Data Compression Conference (dcc 2008).