Optimal Sampling of Parametric Families: Implications for Machine Learning

It is well known in machine learning that models trained on a training set generated by a probability distribution function perform far worse on test sets generated by a different probability distribution function. In the limit, it is feasible that a continuum of probability distribution functions might have generated the observed test set data; a desirable property of a learned model in that case is its ability to describe most of the probability distribution functions from the continuum equally well. This requirement naturally leads to sampling methods from the continuum of probability distribution functions that lead to the construction of optimal training sets. We study the sequential prediction of Ornstein-Uhlenbeck processes that form a parametric family. We find empirically that a simple deep network trained on optimally constructed training sets using the methods described in this letter can be robust to changes in the test set distribution.

[1]  L. Jones Constructive approximations for neural networks by sigmoidal functions , 1990, Proc. IEEE.

[2]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[3]  Wayne A. Fuller,et al.  Predictors for the first-order autoregressive process , 1980 .

[4]  Neri Merhav,et al.  A strong version of the redundancy-capacity theorem of universal coding , 1995, IEEE Trans. Inf. Theory.

[5]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[6]  Benjamin Recht,et al.  Do CIFAR-10 Classifiers Generalize to CIFAR-10? , 2018, ArXiv.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  W. Fuller,et al.  Properties of Predictors for Autoregressive Time Series , 1981 .

[9]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[10]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[11]  Abhisek Kundu,et al.  Recovering PCA and Sparse PCA via Hybrid-(l1, l2) Sparse Sampling of Data Elements , 2017, J. Mach. Learn. Res..

[12]  Vijay Balasubramanian,et al.  A Geometric Formulation of Occam's Razor For Inference of Parametric Distributions , 1996, adap-org/9601001.

[13]  Mehryar Mohri,et al.  Learning Theory and Algorithms for Forecasting Non-stationary Time Series , 2015, NIPS.

[14]  Neri Merhav,et al.  Universal Prediction , 1998, IEEE Trans. Inf. Theory.

[15]  A. Barron THE STRONG ERGODIC THEOREM FOR DENSITIES: GENERALIZED SHANNON-MCMILLAN-BREIMAN THEOREM' , 1985 .

[16]  Cosma Rohilla Shalizi,et al.  Nonparametric Risk Bounds for Time-Series Forecasting , 2012, J. Mach. Learn. Res..

[17]  L. M. M.-T. Theory of Probability , 1929, Nature.