On vocabulary size of grammar-based codes

We discuss inequalities holding between the vocabulary size, i.e., the number of distinct nonterminal symbols in a grammar-based compression for a string, and the excess length of the respective universal code, i.e., the code-based analog of algorithmic mutual information. The aim is to strengthen inequalities which were discussed in a weaker form in linguistics but shed some light on redundancy of efficiently computable codes. The main contribution of the paper is a construction of universal grammar-based codes for which the excess lengths can be bounded easily.

[1]  Lukasz Debowski,et al.  On Hilberg's law and its links with Guiraud's law* , 2005, J. Quant. Linguistics.

[2]  Paul C. Shields,et al.  Universal redundancy rates do not exist , 1993, IEEE Trans. Inf. Theory.

[3]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[4]  W. Hilberg,et al.  Der bekannte Grenzwert der redundanzfreien Information in Texten - eine Fehlinterpretation der Shannonschen Experimente? , 1990 .

[5]  L. Debowski,et al.  Ergodic decomposition of excess entropy and conditional mutual information , 2006 .

[6]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[7]  P. Shields String matching bounds via coding , 1997 .

[8]  J. Gerard Wolfp,et al.  Language Acquisition and the Discovery of Phrase Structure , 1980 .

[9]  P. Shields Universal Redundancy Rates Don't Exist , 1993, Proceedings. IEEE International Symposium on Information Theory.

[10]  String Matching: The Ergodic Case , 1992 .

[11]  David L. Neuhoff,et al.  Simplistic Universal Coding. , 1998, IEEE Trans. Inf. Theory.

[12]  J. Wolff,et al.  Language Acquisition and the Discovery of Phrase Structure , 1980, Language and speech.

[13]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[14]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[15]  Lukasz Debowski,et al.  Menzerath's law for the smallest grammars , 2007, Exact Methods in the Study of Language and Text.

[16]  Craig G. Nevill-Manning,et al.  Inferring Sequential Structure , 1996 .