Predictability , Complexity , and Learning

We deŽne predictive information Ipred(T) as the mutual information between the past and the future of a time series. Three qualitatively different behaviors are found in the limit of large observation times T: Ipred(T) can remain Žnite, grow logarithmically, or grow as a fractional power law. If the time series allows us to learn a model with a Žnite number of parameters, then Ipred(T) grows logarithmically with a coefŽcient that counts the dimensionality of the model space. In contrast, power-law growth is associated, for example, with the learning of inŽnite parameter (or nonparametric) models such as continuous functions with smoothness constraints. There are connections between the predictive information and measures of complexity that have been deŽned both in learning theory and the analysis of physical systems through statistical mechanics and dynamical systems theory. Furthermore, in the same way that entropy provides the unique measure of available information consistent with some simple and plausible conditions, we argue that the divergent part of Ipred(T) provides the unique measure for the complexity of dynamics underlying a time series. Finally, we discuss how these ideas may be useful in problems in physics, statistics, and biology.

[1]  Norbert Wiener,et al.  Extrapolation, Interpolation, and Smoothing of Stationary Time Series, with Engineering Applications , 1949 .

[2]  John G. Kemeny,et al.  The Use of Simplicity in Induction , 1953 .

[3]  A. N. Kolmogorov,et al.  Interpolation and extrapolation of stationary random sequences. , 1962 .

[4]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[5]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[6]  G. Chaitin A Theory of Program Size Formally Identical to Information Theory , 1975, JACM.

[7]  Thomas M. Cover,et al.  A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.

[8]  P. Gennes Scaling Concepts in Polymer Physics , 1979 .

[9]  H. Barlow Intelligence, guesswork, language , 1983, Nature.

[10]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[11]  P. Grassberger Toward a quantitative theory of self-generated complexity , 1986 .

[12]  Lola L. Lopes,et al.  Distinguishing between random and nonrandom events. , 1987 .

[13]  E. Hannan,et al.  On stochastic complexity and nonparametric density estimation , 1988 .

[14]  S. Lloyd,et al.  Complexity as thermodynamic depth , 1988 .

[15]  Young,et al.  Inferring statistical complexity. , 1989, Physical review letters.

[16]  Andrew R. Barron,et al.  Information-theoretic asymptotics of Bayes methods , 1990, IEEE Trans. Inf. Theory.

[17]  Peter Grassberger,et al.  Information and Complexity Measures in Dynamical Systems , 1991 .

[18]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[19]  Michael J. Hawken,et al.  Spatial receptive field organization in monkey V1 and its relationship to the cone mosaic , 1991 .

[20]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[21]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[22]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[23]  Opper Learning and generalization in a two-layer neural network: The role of the Vapnik-Chervonvenkis dimension. , 1994, Physical review letters.

[24]  Andreas S. Weigend,et al.  Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[25]  W. Ebeling,et al.  Entropy and Long-Range Correlations in Literary English , 1993, cond-mat/0204108.

[26]  W. Ebeling,et al.  Guessing probability distributions from small samples , 1995, cond-mat/0203467.

[27]  W. Wong,et al.  Probability inequalities for likelihood ratios and convergence rates of sieve MLEs , 1995 .

[28]  R. H. S. Carpenter,et al.  Neural computation of log likelihood in control of saccadic eye movements , 1995, Nature.

[29]  Opper,et al.  Bounds for predictive errors in the statistical mechanics of supervised learning. , 1995, Physical review letters.

[30]  David Haussler,et al.  General bounds on the mutual information between a parameter and n conditionally independent observations , 1995, COLT '95.

[31]  D. Haussler,et al.  Rigorous learning curve bounds from statistical mechanics , 1996 .

[32]  Callan,et al.  Field Theories for Learning Probability Distributions. , 1996, Physical review letters.

[33]  Seth Lloyd,et al.  Information measures, effective complexity, and total information , 1996 .

[34]  David L. Sheinberg,et al.  Visual object recognition. , 1996, Annual review of neuroscience.

[35]  William Bialek,et al.  Entropy and Information in Neural Spike Trains , 1996, cond-mat/9603127.

[36]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[37]  V. Periwal Reparametrization Invariant Statistical Inference and Gravity , 1997, hep-th/9703135.

[38]  T. Holy Analysis of Data from Continuous Probability Distributions , 1997, physics/9706015.

[39]  J. Crutchfield,et al.  Statistical complexity of simple one-dimensional spin systems , 1997, cond-mat/9702191.

[40]  D. Haussler,et al.  MUTUAL INFORMATION, METRIC ENTROPY AND CUMULATIVE RELATIVE ENTROPY RISK , 1997 .

[41]  V. Periwal Geometric statistical inference , 1997, adap-org/9801001.

[42]  Vijay Balasubramanian,et al.  Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions , 1996, Neural Computation.

[43]  Michael J. Berry,et al.  The structure and precision of retinal spike trains. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Nicolas Brunel,et al.  Mutual Information, Fisher Information, and Population Coding , 1998, Neural Computation.

[45]  Seth Daniel Bruder Predictive information in spike trains from the blowfly and monkey visual systems , 1998 .

[46]  J. Crutchfield,et al.  Measures of statistical complexity: Why? , 1998 .

[47]  Klaus Sutner,et al.  Computation theory of cellular automata , 1998 .

[48]  D. Haussler,et al.  Worst Case Prediction over Sequences under Log Loss , 1999 .

[49]  J. Crutchfield,et al.  Thermodynamic depth of causal states: Objective complexity via minimal representations , 1999 .

[50]  R. Solé,et al.  Statistical measures of complexity for strongly interacting systems , 1999, adap-org/9909002.

[51]  J. Nadal,et al.  Unsupervised and supervised learning: Mutual information between parameters and observations , 1999 .

[52]  G. Robinson,et al.  Neuroethology of spatial learning: the birds and the bees. , 1999, Annual review of psychology.

[53]  T J Sejnowski,et al.  Motion integration and postdiction in visual awareness. , 2000, Science.

[54]  R. Reid,et al.  Temporal Coding of Visual Information in the Thalamus , 2000, The Journal of Neuroscience.

[55]  William Bialek,et al.  Learning Continuous Distributions: Simulations With Field Theoretic Priors , 2000, NIPS.

[56]  C. Adami,et al.  Physical complexity of symbolic sequences , 1996, adap-org/9605002.

[57]  Ilya Nemenman,et al.  Information theory and learning: a physical approach , 2000, ArXiv.

[58]  D. M. Schmidt Continuous probability distributions from finite data. , 1998, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[59]  Ming Li,et al.  Minimum description length induction, Bayesianism, and Kolmogorov complexity , 1999, IEEE Trans. Inf. Theory.

[60]  Crutchfield,et al.  Comment I on "Simple measure for complexity" , 1999, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[61]  H. Sompolinsky,et al.  Mutual information of population codes and distance measures in probability space. , 2001, Physical review letters.

[62]  R. Schapire,et al.  Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension , 1991, Machine Learning.