Prediction with a short memory

We consider the problem of predicting the next observation given a sequence of past observations, and consider the extent to which accurate prediction requires complex algorithms that explicitly leverage long-range dependencies. Perhaps surprisingly, our positive results show that for a broad class of sequences, there is an algorithm that predicts well on average, and bases its predictions only on the most recent few observation together with a set of simple summary statistics of the past observations. Specifically, we show that for any distribution over observations, if the mutual information between past observations and future observations is upper bounded by I, then a simple Markov model over the most recent I/є observations obtains expected KL error є—and hence ℓ1 error √є—with respect to the optimal predictor that has access to the entire past and knows the data generating distribution. For a Hidden Markov Model with n hidden states, I is bounded by logn, a quantity that does not depend on the mixing time, and we show that the trivial prediction algorithm based on the empirical frequencies of length O(logn/є) windows of observations achieves this error, provided the length of the sequence is dΩ(logn/є), where d is the size of the observation alphabet. We also establish that this result cannot be improved upon, even for the class of HMMs, in the following two senses: First, for HMMs with n hidden states, a window length of logn/є is information-theoretically necessary to achieve expected KL error є, or ℓ1 error √є. Second, the dΘ(logn/є) samples required to accurately estimate the Markov model when observations are drawn from an alphabet of size d is necessary for any computationally tractable learning/prediction algorithm, assuming the hardness of strongly refuting a certain class of CSPs.

[1]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[2]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .

[3]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[4]  D. Freedman,et al.  On the consistency of Bayes estimates , 1986 .

[5]  Andrew R. Barron,et al.  Information-theoretic asymptotics of Bayes methods , 1990, IEEE Trans. Inf. Theory.

[6]  Dean Phillips Foster Prediction in the Worst Case , 1991 .

[7]  B. McNaughton,et al.  Reactivation of hippocampal ensemble memories during sleep. , 1994, Science.

[8]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[9]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[10]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  D. Haussler,et al.  MUTUAL INFORMATION, METRIC ENTROPY AND CUMULATIVE RELATIVE ENTROPY RISK , 1997 .

[13]  Avner Friedman The Mathematics of Information Coding, Extraction and Distribution. , 1997 .

[14]  Tj Sweeting,et al.  Invited discussion of A. R. Barron: Information-theoretic characterization of Bayes performance and the choice of priors in parametric and nonparametric problems , 1998 .

[15]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[16]  L. Wasserman,et al.  The consistency of posterior distributions in nonparametric problems , 1999 .

[17]  D. Haussler,et al.  Worst Case Prediction over Sequences under Log Loss , 1999 .

[18]  Y. Shtarkov AIM FUNCTIONS AND SEQUENTIAL ESTIMATION OF THE SOURCE MODEL FOR UNIVERSAL CODING , 1999 .

[19]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[20]  V. Vovk Competitive On‐line Statistics , 2001 .

[21]  Tong Zhang,et al.  Learning Bounds for a Generalized Family of Bayesian Posterior Distributions , 2003, NIPS.

[22]  Adam Tauman Kalai,et al.  Noise-tolerant learning, the parity problem, and the statistical query model , 2000, STOC '00.

[23]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[24]  Sham M. Kakade,et al.  Online Bounds for Bayesian Algorithms , 2004, NIPS.

[25]  Nicolò Cesa-Bianchi,et al.  Worst-Case Bounds for the Logarithmic Loss of Predictors , 1999, Machine Learning.

[26]  Ronald,et al.  Learning representations by backpropagating errors , 2004 .

[27]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[28]  G. Miller Learning to Forget , 2004, Science.

[29]  Sham M. Kakade,et al.  Worst-Case Bounds for Gaussian Process Models , 2005, NIPS.

[30]  Elchanan Mossel,et al.  Learning nonsingular phylogenies and hidden Markov models , 2005, STOC '05.

[31]  Rocco A. Servedio,et al.  Agnostically learning halfspaces , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[32]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[33]  Sham M. Kakade,et al.  A spectral algorithm for learning Hidden Markov Models , 2008, J. Comput. Syst. Sci..

[34]  Ryan O'Donnell,et al.  Polynomial regression under arbitrary product distributions , 2010, Machine Learning.

[35]  Jaikumar Radhakrishnan,et al.  The Communication Complexity of Correlation , 2007, IEEE Transactions on Information Theory.

[36]  Madhur Tulsiani,et al.  SDP Gaps from Pairwise Independence , 2012, Theory Comput..

[37]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[38]  T. Sanders Analysis of Boolean Functions , 2012, ArXiv.

[39]  Prasad Raghavendra,et al.  Approximate Constraint Satisfaction Requires Large LP Relaxations , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[40]  L. Reyzin,et al.  Statistical algorithms and a lower bound for detecting planted cliques , 2012, STOC '13.

[41]  Madhur Tulsiani,et al.  LS+ Lower Bounds from Pairwise Independence , 2013, 2013 IEEE Conference on Computational Complexity.

[42]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[43]  Ryan O'Donnell,et al.  Analysis of Boolean Functions , 2014, ArXiv.

[44]  Santosh S. Vempala,et al.  University of Birmingham On the Complexity of Random Satisfiability Problems with Planted Solutions , 2018 .

[45]  David Witmer,et al.  Goldreich's PRG: Evidence for Near-Optimal Polynomial Stretch , 2014, 2014 IEEE 29th Conference on Computational Complexity (CCC).

[46]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[47]  Ryan O'Donnell,et al.  How to Refute a Random CSP , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[48]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[49]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[50]  Pravesh Kothari,et al.  Sum of Squares Lower Bounds from Pairwise Independence , 2015, STOC.

[51]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[52]  Ryuhei Mori,et al.  Lower bounds for CSP refutation by SDP hierarchies , 2016, APPROX-RANDOM.

[53]  Amit Daniely,et al.  Complexity Theoretic Limitations on Learning DNF's , 2014, COLT.

[54]  Amit Daniely,et al.  Complexity theoretic limitations on learning halfspaces , 2015, STOC.

[55]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[56]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[57]  Anima Anandkumar,et al.  Training Input-Output Recurrent Neural Networks through Spectral Methods , 2016, ArXiv.

[58]  M. Wilson,et al.  Uncovering representations of sleep-associated hippocampal ensemble spike activity , 2016, Scientific Reports.

[59]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[60]  Benjamin Van Roy,et al.  An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[61]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[62]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[63]  Matthew A. Wilson,et al.  Deciphering Neural Codes of Memory during Sleep , 2017, Trends in Neurosciences.

[64]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[65]  Ryan O'Donnell,et al.  Sum of squares lower bounds for refuting any CSP , 2017, STOC.