Bias-Variance Trade-offs: Novel Applications

Consider a given random variable F and a random variable that we can modify, F̂ . We wish to use a sample of F̂ as an estimate of a sample of F . The mean squared error between such a pair of samples is a sum of four terms. The first term reflects the statistical coupling between F and F̂ and is conventionally ignored in bias-variance analysis. The second term reflects the inherent noise in F and is independent of the estimator F̂ . Accordingly, we cannot affect this term. In contrast, the third and fourth terms depend on F̂ . The third term, called the bias, is independent of the precise samples of both F and F̂ , and reflects the difference between the means of F and F̂ . The fourth term, called the variance, is independent of the precise sample of F , and reflects the inherent noise in the estimator as one samples it. These last two terms can be modified by changing the choice of the estimator. In particular, on small sample sets, we can often decrease our mean squared error by, for instance, introducing a small bias that causes a large reduction the variance. While most commonly used in machine learning, this article shows that such bias-variance trade-offs are applicable in a much broader context and in a variety of situations. We also show, using experiments, how existing bias-variance trade-offs can be applied in novel circumstances to improve the performance of a class of optimization algorithms.

[1]  Dirk P. Kroese,et al.  Cross‐Entropy Method , 2011 .

[2]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[3]  Wray L. Buntine,et al.  Bayesian Back-Propagation , 1991, Complex Syst..

[4]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[5]  David H. Wolpert,et al.  On Bias Plus Variance , 1997, Neural Computation.

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  David H. Wolpert,et al.  Distributed control by Lagrangian steepest descent , 2004, 2004 43rd IEEE Conference on Decision and Control (CDC) (IEEE Cat. No.04CH37601).

[8]  David H. Wolpert,et al.  Parametric Learning and Monte Carlo Optimization , 2007, ArXiv.

[9]  Dana Angluin,et al.  Computational learning theory: survey and selected bibliography , 1992, STOC '92.

[10]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[11]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[12]  David H. Wolpert,et al.  Advances in Distributed Optimization Using Probability Collectives , 2006, Adv. Complex Syst..

[13]  G. Lepage A new algorithm for adaptive multidimensional integration , 1978 .

[14]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[15]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[16]  David H. Wolpert,et al.  Distributed Constrained Optimization with Semicoordinate Transformations , 2008, ArXiv.

[17]  Yuri Ermoliev,et al.  Monte Carlo Optimization and Path Dependent Nonstationary Laws of Large Numbers , 1998 .

[18]  Padhraic Smyth,et al.  Linearly Combining Density Estimators via Stacking , 1999, Machine Learning.