The shortest common superstring problem: Average case analysis for both exact and approximate matching

The shortest common superstring problem and its extension to approximate matching are considered in the probability model where each string in a given set has the same length and letters of strings are drawn independently from a finite set. In the exact matching case, several algorithms proposed in the literature are shown to be asymptotically optimal in the sense that the ratio of the savings resulting from the superstring constructed by each of these algorithms, that is the difference between the total length of the strings in the given set and the length of the superstring, to the optimal savings from the shortest superstring approaches in probability to 1 as the number of strings in the given set increases. In the approximate matching case, a modified version of the shortest common approximate matching superstring problem is analyzed; it is demonstrated that the optimal savings in this case is given approximately by nlogn/I/sub l/(Q,Q,2D), where n is the number of strings in the given set, Q is the probability distribution governing the selection of letters of strings, I/sub l/(Q,Q,2D) is the lower mutual information between Q and Q with respect to 2D, and D/spl ges/0 is the distortion allowed in approximate matching. In addition, an approximation algorithm is proposed and proved asymptotically optimal.

[1]  Kenneth S. Alexander SHORTEST COMMON SUPERSTRINGS OF RANDOM STRINGS , 1996 .

[2]  Tao Jiang,et al.  Linear approximation of shortest superstrings , 1994, JACM.

[3]  En-Hui Yang,et al.  On the redundancy of the fixed-database Lempel-Ziv algorithm for phi -mixing sources , 1997, IEEE Trans. Inf. Theory.

[4]  Jonathan S. Turner,et al.  Approximation Algorithms for the Shortest Common Superstring Problem , 1989, Inf. Comput..

[5]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[6]  W. Szpankowski,et al.  A Greedy Algorithm for the Shortest Common Superstring is Asymptotically Optimal , 1995 .

[7]  John Gallant String compression algorithms , 1982 .

[8]  Esko Ukkonen,et al.  A Greedy Approximation Algorithm for Constructing Shortest Common Superstrings , 1988, Theor. Comput. Sci..

[9]  I. N. Sanov On the probability of large deviations of random variables , 1958 .

[10]  W. Hoeffding Asymptotically Optimal Tests for Multinomial Distributions , 1965 .

[11]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[12]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[13]  En-Hui Yang,et al.  On the Performance of Data Compression Algorithms Based Upon String Matching , 1998, IEEE Trans. Inf. Theory.

[14]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[15]  Hans Söderlund,et al.  Algorithms for Some String Matching Problems Arising in Molecular Genetics , 1983, IFIP Congress.

[16]  Amiel Feinstein,et al.  Information and information stability of random variables and processes , 1964 .

[17]  Alan M. Frieze,et al.  Greedy Algorithms for the Shortest Common Superstring That Are Asymptotically Optimal , 1998, Algorithmica.

[18]  Zhen Zhang,et al.  An On-Line Universal Lossy Data Compression Algorithm via Continuous Codebook Refinement - Part III: Redundancy Analysis , 1998, IEEE Trans. Inf. Theory.

[19]  Arthur M. Lesk Computational Molecular Biology: Sources and Methods for Sequence Analysis , 1989 .

[20]  Toby Berger,et al.  Rate distortion theory : a mathematical basis for data compression , 1971 .

[21]  Zhen Zhang,et al.  The redundancy of source coding with a fidelity criterion: 1. Known statistics , 1997, IEEE Trans. Inf. Theory.

[22]  Aaron D. Wyner,et al.  Fixed data base version of the Lempel-Ziv data compression algorithm , 1991, IEEE Trans. Inf. Theory.

[23]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.