The smallest grammar problem

This paper addresses the smallest grammar problem: What is the smallest context-free grammar that generates exactly one given string /spl sigma/? This is a natural question about a fundamental object connected to many fields such as data compression, Kolmogorov complexity, pattern identification, and addition chains. Due to the problem's inherent complexity, our objective is to find an approximation algorithm which finds a small grammar for the input string. We focus attention on the approximation ratio of the algorithm (and implicitly, the worst case behavior) to establish provable performance guarantees and to address shortcomings in the classical measure of redundancy in the literature. Our first results are concern the hardness of approximating the smallest grammar problem. Most notably, we show that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569/8568 unless P=NP. We then bound approximation ratios for several of the best known grammar-based compression algorithms, including LZ78, B ISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and RE-PAIR. Among these, the best upper bound we show is O(n/sup 1/2/). We finish by presenting two novel algorithms with exponentially better ratios of O(log/sup 3/n) and O(log(n/m/sup */)), where m/sup */ is the size of the smallest grammar for that input. The latter algorithm highlights a connection between grammar-based compression and LZ77.

[1]  Edward G. Thurber Efficient Generation of Minimal Length Addition Chains , 1999, SIAM J. Comput..

[2]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[3]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[4]  Craig G. Nevill-Manning,et al.  Inferring Sequential Structure , 1996 .

[5]  Philippe Flajolet,et al.  Data compression via binary decision diagrams , 2000, 2000 IEEE International Symposium on Information Theory (Cat. No.00CH37060).

[6]  Dake He,et al.  Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform .2. With context models , 2000, IEEE Trans. Inf. Theory.

[7]  Tien-Fu Chen,et al.  Compressing Inverted Files in Scalable Information Systems by Binary Decision Diagram Encoding , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[8]  Stefano Lonardi,et al.  Compression of biological sequences by greedy off-line textual substitution , 2000, Proceedings DCC 2000. Data Compression Conference.

[9]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[10]  P. Erdös Remarks on number theory III. On addition chains , 1960 .

[11]  Stefano Lonardi,et al.  Some theory and practice of greedy off-line textual substitution , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[12]  Peter J. Downey,et al.  Computing Sequences with Addition Chains , 1981, SIAM J. Comput..

[13]  Pamela C. Cosman,et al.  Universal lossless compression via multilevel pattern matching , 2000, IEEE Trans. Inf. Theory.

[14]  En-Hui Yang,et al.  Sequential codes, lossless compression of individual sequences, and Kolmogorov complexity , 1996, IEEE Trans. Inf. Theory.

[15]  Craig G. Nevill-Manning,et al.  Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..

[16]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[17]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[18]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[19]  En-Hui Yang,et al.  Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform - Part one: Without context models , 2000, IEEE Trans. Inf. Theory.

[20]  Giovanni Manzini,et al.  Compression of Low Entropy Strings with Lempel-Ziv Algorithms , 1999, SIAM J. Comput..

[21]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[22]  R. Jansen,et al.  LANGUAGE ACQUISITION , 1977, The Medical journal of Australia.

[23]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[24]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[25]  Andrew Chi-Chih Yao,et al.  On the Evaluation of Powers , 1976, SIAM J. Comput..

[26]  Nicholas Pippenger,et al.  On the Evaluation of Powers and Monomials , 1980, SIAM J. Comput..

[27]  Janusz Rajski,et al.  The testability-preserving concurrent decomposition and factorization of Boolean expressions , 1992, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[28]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[29]  Randal E. Bryant,et al.  Graph-Based Algorithms for Boolean Function Manipulation , 1986, IEEE Transactions on Computers.

[30]  S. Arikawa,et al.  Byte Pair Encoding: a Text Compression Scheme That Accelerates Pattern Matching , 1999 .

[31]  A. Apostolico,et al.  Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[32]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[33]  P. Berman,et al.  On Some Tighter Inapproximability Results , 1998, Electron. Colloquium Comput. Complex..

[34]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .

[35]  Tao Jiang,et al.  Linear approximation of shortest superstrings , 1991, STOC '91.

[36]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[37]  En-Hui Yang,et al.  Estimating DNA sequence entropy , 2000, SODA '00.

[38]  Alistair Moffat,et al.  Off-line dictionary-based compression , 2000 .

[39]  Ayumi Shinohara,et al.  A unifying framework for compressed pattern matching , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[40]  Marek Karpinski,et al.  On Some Tighter Inapproximability Results, Further Improvements , 1998, Electron. Colloquium Comput. Complex..