Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman's alphabet decomposition is known to achieve state-of-the-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efficiency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multi-alphabet prediction performance of CTW-based algorithms.

[1]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[2]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[3]  Andrew R. Barron,et al.  Asymptotic minimax regret for data compression, gambling, and prediction , 1997, IEEE Trans. Inf. Theory.

[4]  Y. Shtarkov,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[5]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[6]  Serap A. Savari,et al.  Redundancy of the Lempel-Ziv incremental parsing rule , 1997, IEEE Trans. Inf. Theory.

[7]  En-Hui Yang,et al.  A simple technique for bounding the pointwise redundancy of the 1978 Lempel-Ziv algorithm , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[8]  Neri Merhav,et al.  Universal Prediction , 1998, IEEE Trans. Inf. Theory.

[9]  Richard M. Karp,et al.  On the Optimality of Huffman Trees , 1976 .

[10]  Hiroshi Imai,et al.  Implementing the context tree weighting method for text compression , 2000, Proceedings DCC 2000. Data Compression Conference.

[11]  Neil J. A. Sloane,et al.  The encyclopedia of integer sequences , 1995 .

[12]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[13]  Robert E. Schapire,et al.  Predicting Nearly As Well As the Best Pruning of a Decision Tree , 1995, COLT '95.

[14]  Olivier Catoni,et al.  Statistical learning theory and stochastic optimization , 2004 .

[15]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[16]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[17]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[18]  David Haussler,et al.  Sequential Prediction of Individual Sequences Under General Loss Functions , 1998, IEEE Trans. Inf. Theory.

[19]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[20]  Alon Orlitsky,et al.  Always Good Turing: Asymptotically Optimal Probability Estimation , 2003, Science.

[21]  Dmitry A. Shkarin,et al.  PPM: one step to practicality , 2002, Proceedings DCC 2002. Data Compression Conference.

[22]  Yoav Freund,et al.  Predicting a binary sequence almost as well as the optimal biased coin , 2003, COLT '96.

[23]  Frans M. J. Willems,et al.  The Context-Tree Weighting Method : Extensions , 1998, IEEE Trans. Inf. Theory.

[24]  Y. Shtarkov,et al.  Multi-alphabet universal coding using a binary decomposition context tree weighting algorithm , 1994 .

[25]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[26]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[27]  En-Hui Yang,et al.  Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform - Part one: Without context models , 2000, IEEE Trans. Inf. Theory.

[28]  Philippe Jacquet,et al.  Markov types and minimax redundancy for Markov sources , 2004, IEEE Transactions on Information Theory.

[29]  Ran El-Yaniv,et al.  On Prediction Using Variable Order Markov Models , 2004, J. Artif. Intell. Res..

[30]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[31]  R. Courant,et al.  Introduction to Calculus and Analysis , 1991 .

[32]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[33]  Vladimir N. Potapov,et al.  Redundancy estimates for the Lempel-Ziv algorithm of data compression , 2004, Discret. Appl. Math..

[34]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[35]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[36]  Golan Yona,et al.  Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..

[37]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[38]  Dake He,et al.  Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform .2. With context models , 2000, IEEE Trans. Inf. Theory.

[39]  W. Teahan,et al.  Experiments on the zero frequency problem , 1995, Proceedings DCC '95 Data Compression Conference.

[40]  Soo-Young Lee,et al.  Support Vector Machines with Binary Tree Architecture for Multi-Class Classification , 2004 .

[41]  Michelle Effros,et al.  Universal lossless source coding with the Burrows Wheeler transform , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[42]  David A. McAllester,et al.  On the Convergence Rate of Good-Turing Estimators , 2000, COLT.

[43]  Tj Tjalling Tjalkens,et al.  A context-tree weighting method for text generating sources , 1997, Proceedings DCC '97. Data Compression Conference.