Asymptotic Bayesian generalization error when training and test distributions are different

In supervised learning, we commonly assume that training and test data are sampled from the same distribution. However, this assumption can be violated in practice and then standard machine learning techniques perform poorly. This paper focuses on revealing and improving the performance of Bayesian estimation when the training and test distributions are different. We formally analyze the asymptotic Bayesian generalization error and establish its upper bound under a very general setting. Our important finding is that lower order terms---which can be ignored in the absence of the distribution change---play an important role under the distribution change. We also propose a novel variant of stochastic complexity which can be used for choosing an appropriate model and hyper-parameters under a particular distribution change.

[1]  H. Akaike A new look at the statistical model identification , 1974 .

[2]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[3]  Yi Lin,et al.  Support Vector Machines for Classification in Nonstandard Situations , 2002, Machine Learning.

[4]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[5]  Masashi Sugiyama,et al.  Input-dependent estimation of generalization error under covariate shift , 2005 .

[6]  G. Pfurtscheller,et al.  Brain-Computer Interfaces for Communication and Control. , 2011, Communications of the ACM.

[7]  Christian R. Shelton,et al.  Importance sampling for reinforcement learning with multiple objectives , 2001 .

[8]  Masashi Sugiyama,et al.  Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error , 2006, J. Mach. Learn. Res..

[9]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[10]  Gustavo A. Stolovitzky,et al.  Bioinformatics: The Machine Learning Approach , 2002 .

[11]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[12]  Sumio Watanabe Algebraic Information Geometry for Learning Machines with Singularities , 2000, NIPS.

[13]  D. Wiens Robust weights and designs for biased regression models: Least squares and generalized M-estimation , 2000 .

[14]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[15]  H. White Maximum Likelihood Estimation of Misspecified Models , 1982 .

[16]  Hidetoshi Shimodaira,et al.  Active learning algorithm using the maximum weighted log-likelihood estimator , 2003 .

[17]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[18]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[19]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[20]  Sumio Watanabe,et al.  Algebraic Analysis for Nonidentifiable Learning Machines , 2001, Neural Computation.

[21]  J. Heckman Sample selection bias as a specification error , 1979 .

[22]  Sumio Watanabe Algebraic Analysis for Singular Statistical Estimation , 1999, ALT.