A Decision-Theoretic Extension of Stochastic Complexity and Its Applications to Learning

Rissanen (1978) has introduced stochastic complexity to define the amount of information in a given data sequence relative to a given hypothesis class of probability densities, where the information is measured in terms of the logarithmic loss associated with universal data compression. This paper introduces the notion of extended stochastic complexity (ESC) and demonstrates its effectiveness in design and analysis of learning algorithms in on-line prediction and batch-learning scenarios. ESC can be thought of as an extension of Rissanen's stochastic complexity to the decision-theoretic setting where a general real-valued function is used as a hypothesis and a general loss function is used as a distortion measure. As an application of ESC to on-line prediction, this paper shows that a sequential realization of ESC produces an on-line prediction algorithm called Vovk's aggregating strategy, which can be thought of as an extension of the Bayes algorithm. We derive upper bounds on the cumulative loss for the aggregating strategy both of an expected form and a worst case form in the case where the hypothesis class is continuous. As an application of ESC to batch-learning, this paper shows that a batch-approximation of ESC induces a batch-learning algorithm called the minimum L-complexity algorithm (MLC), which is an extension of the minimum description length (MDL) principle. We derive upper bounds on the statistical risk for the MLC, which are the least to date. Through the ESC we give a unifying view of the most effective learning algorithms that have been explored in computational learning theory.

[1]  Claude E. Shannon,et al.  A Mathematical Theory of Communications , 1948 .

[2]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[3]  N. D. Bruijn Asymptotic methods in analysis , 1958 .

[4]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[5]  JORMA RISSANEN,et al.  A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[6]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[7]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[8]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[9]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[10]  Andrew R. Barron,et al.  Information-theoretic asymptotics of Bayes methods , 1990, IEEE Trans. Inf. Theory.

[11]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[12]  Kenji Yamanishi,et al.  A learning criterion for stochastic rules , 1990, COLT '90.

[13]  Andrew R. Barron,et al.  Complexity Regularization with Application to Artificial Neural Networks , 1991 .

[14]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[15]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[16]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[17]  Manfred K. Warmuth,et al.  Using experts for predicting continuous outcomes , 1994, European Conference on Computational Learning Theory.

[18]  David Haussler,et al.  Tight worst-case loss bounds for predicting with expert advice , 1994, EuroCOLT.

[19]  Kenji Yamanishi,et al.  A Loss Bound Model for On-Line Stochastic Prediction Algorithms , 1995, Inf. Comput..

[20]  Yoav Freund,et al.  Predicting a binary sequence almost as well as the optimal biased coin , 2003, COLT '96.

[21]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[22]  Manfred K. Warmuth,et al.  How to use expert advice , 1997, JACM.

[23]  Kenji Yamanishi,et al.  On-Line Maximum Likelihood Prediction with Respect to General Loss Functions , 1997, J. Comput. Syst. Sci..

[24]  Vijay Balasubramanian,et al.  Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions , 1996, Neural Computation.

[25]  Kenji Yamanishi,et al.  Minimax relative loss analysis for sequential prediction algorithms using parametric hypotheses , 1998, COLT' 98.