Hierarchical universal coding

In an earlier paper, we proved a strong version of the redundancy-capacity converse theorem of universal coding, stating that for "most" sources in a given class, the universal coding redundancy is essentially lower-bounded by the capacity of the channel induced by this class. Since this result holds for general classes of sources, it extends Rissanen's (1986) strong converse theorem for parametric families. While our earlier result has established strong optimality only for mixture codes weighted by the capacity-achieving prior, our first result herein extends this finding to a general prior. For some cases our technique also leads to a simplified proof of the above mentioned strong converse theorem. The major interest in this paper, however, is in extending the theory of universal coding to hierarchical structures of classes, where each class may have a different capacity. In this setting, one wishes to incur redundancy essentially as small as that corresponding to the active class, and not the union of classes. Our main result is that the redundancy of a code based on a two-stage mixture (first, within each class, and then over the classes), is no worse than that of any other code for "most" sources of "most" classes. If, in addition, the classes can be efficiently distinguished by a certain decision rule, then the best attainable redundancy is given explicitly by the capacity of the active class plus the normalized negative logarithm of the prior probability assigned to this class. These results suggest some interesting guidelines as for the choice of the prior. We also discuss some examples with a natural hierarchical partition into classes.

[1]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[2]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[3]  Jorma Rissanen,et al.  Complexity of strings in the class of Markov sources , 1986, IEEE Trans. Inf. Theory.

[4]  Neri Merhav,et al.  Optimal sequential probability assignment for individual sequences , 1994, IEEE Trans. Inf. Theory.

[5]  R. Gallager Information Theory and Reliable Communication , 1968 .

[6]  Lee D. Davisson,et al.  Minimax noiseless universal coding for Markov sources , 1983, IEEE Trans. Inf. Theory.

[7]  B. G. Quinn,et al.  The determination of the order of an autoregression , 1979 .

[8]  Alberto Leon-Garcia,et al.  A source matching approach to finding minimax codes , 1980, IEEE Trans. Inf. Theory.

[9]  Lee D. Davisson,et al.  Universal noiseless coding , 1973, IEEE Trans. Inf. Theory.

[10]  Thomas M. Cover,et al.  On the competitive optimality of Huffman codes , 1991, IEEE Trans. Inf. Theory.

[11]  JORMA RISSANEN,et al.  A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[12]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[13]  A. Barron,et al.  Jeffreys' prior is asymptotically least favorable under entropy risk , 1994 .

[14]  Andrew R. Barron,et al.  Information-theoretic asymptotics of Bayes methods , 1990, IEEE Trans. Inf. Theory.

[15]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[16]  M. Feder,et al.  Predictive stochastic complexity and model estimation for finite-state processes , 1994 .

[17]  A. Barron,et al.  APPROXIMATION OF DENSITY FUNCTIONS BY SEQUENCES OF EXPONENTIAL FAMILIES , 1991 .

[18]  Neri Merhav,et al.  A strong version of the redundancy-capacity theorem of universal coding , 1995, IEEE Trans. Inf. Theory.

[19]  Steven Rudich,et al.  Inferring the Structure of a Markov Chain from its Output , 1985, FOCS.