Adaptive transfer learning

In transfer learning, we wish to make inference about a target population when we have access to data both from the distribution itself, and from a different but related source distribution. We introduce a flexible framework for transfer learning in the context of binary classification, allowing for covariate-dependent relationships between the source and target distributions that are not required to preserve the Bayes decision boundary. Our main contributions are to derive the minimax optimal rates of convergence (up to poly-logarithmic factors) in this problem, and show that the optimal rate can be achieved by an algorithm that adapts to key aspects of the unknown transfer relationship, as well as the smoothness and tail parameters of our distributional classes. This optimal rate turns out to have several regimes, depending on the interplay between the relative sample sizes and the strength of the transfer relationship, and our algorithm achieves optimality by careful, decision tree-based calibration of local nearest-neighbour procedures.

[1]  Gavin Brown,et al.  Minimax rates for cost-sensitive learning on manifolds with approximate nearest neighbours , 2017, ALT.

[2]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[3]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[4]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[5]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[6]  Jonas Peters,et al.  The Difficult Task of Distribution Generalization in Nonlinear Models , 2020 .

[7]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[8]  Nagarajan Natarajan,et al.  Learning from binary labels with instance-dependent noise , 2018, Machine Learning.

[9]  Jianxin Zhang,et al.  Learning from Multiple Corrupted Sources, with Application to Learning from Label Proportions , 2019, ArXiv.

[10]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[11]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2007, 0708.2321.

[12]  Sanjoy Dasgupta,et al.  Rates of Convergence for Nearest Neighbor Classification , 2014, NIPS.

[13]  Yingying Fan,et al.  Classification with imperfect training labels , 2018, Biometrika.

[14]  Jonas Peters,et al.  Distributional robustness as a guiding principle for causality in cognitive neuroscience , 2020, 2002.06060.

[15]  R. Samworth Optimal weighted nearest neighbour classifiers , 2011, 1101.5783.

[16]  Samory Kpotufe,et al.  Marginal Singularity, and the Benefits of Labels in Covariate-Shift , 2018, COLT.

[17]  Bin Yu Assouad, Fano, and Le Cam , 1997 .

[18]  F. Ledrappier,et al.  The metric entropy of diffeomorphisms Part I: Characterization of measures satisfying Pesin's entropy formula , 1985 .

[19]  Stochastic Programming,et al.  Logarithmic Concave Measures and Related Topics , 1980 .

[20]  Ata Kabán,et al.  Fast Rates for a kNN Classifier Robust to Unknown Asymmetric Label Noise , 2019, ICML.

[21]  Robert D. Nowak,et al.  Minimax-optimal classification with dyadic decision trees , 2006, IEEE Transactions on Information Theory.

[22]  András Prékopa,et al.  Contributions to the theory of stochastic programming , 1973, Math. Program..

[23]  L. Evans Measure theory and fine properties of functions , 1992 .

[24]  Neil D. Lawrence,et al.  When Training and Test Sets Are Different: Characterizing Learning Transfer , 2009 .

[25]  John C. Duchi,et al.  Certifying Some Distributional Robustness with Principled Adversarial Training , 2017, ICLR.

[26]  Yishay Mansour,et al.  Domain Adaptation: Learning Bounds and Algorithms , 2009, COLT.

[27]  J. Littlewood,et al.  A maximal theorem with function-theoretic applications , 1930 .

[28]  T. Tony Cai,et al.  Transfer Learning for Nonparametric Classification: Minimax Rate and Adaptive Classifier , 2019, The Annals of Statistics.

[29]  Stefan Wager,et al.  Adaptive Concentration of Regression Trees, with Application to Random Forests , 2015 .

[30]  M. Cule,et al.  Theoretical properties of the log-concave maximum likelihood estimator of a multidimensional density , 2009, 0908.4400.

[31]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[32]  Steve Hanneke,et al.  On the Value of Target Data in Transfer Learning , 2020, NeurIPS.

[33]  Mehryar Mohri,et al.  Adaptation Based on Generalized Discrepancy , 2019, J. Mach. Learn. Res..

[34]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[35]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[36]  François Laviolette,et al.  Risk bounds for the majority vote: from a PAC-Bayesian analysis to a learning algorithm , 2015, J. Mach. Learn. Res..

[37]  Gilles Blanchard,et al.  Domain Generalization by Marginal Transfer Learning , 2017, J. Mach. Learn. Res..

[38]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[39]  Takafumi Kanamori,et al.  Density Ratio Estimation in Machine Learning , 2012 .

[40]  Alexander J. Smola,et al.  Detecting and Correcting for Label Shift with Black Box Predictors , 2018, ICML.

[41]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[42]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[43]  Timothy I. Cannings,et al.  Local nearest neighbour classification with applications to semi-supervised learning , 2017, The Annals of Statistics.

[44]  W. Polonik Measuring Mass Concentrations and Estimating Density Contour Clusters-An Excess Mass Approach , 1995 .

[45]  Luc Devroye,et al.  Cellular Tree Classifiers , 2013, ALT.

[46]  Yuekai Sun,et al.  Minimax optimal approaches to the label shift problem , 2020, 2003.10443.

[47]  Sanjeev R. Kulkarni,et al.  Rates of convergence of nearest neighbor estimation under arbitrary sampling , 1995, IEEE Trans. Inf. Theory.

[48]  Karsten M. Borgwardt,et al.  Covariate Shift by Kernel Mean Matching , 2009, NIPS 2009.

[49]  R. Cooke Real and Complex Analysis , 2011 .

[50]  Mehryar Mohri,et al.  New Analysis and Algorithm for Learning with Drifting Distributions , 2012, ALT.

[51]  Luc Devroye,et al.  Lectures on the Nearest Neighbor Method , 2015 .

[52]  Santosh S. Vempala,et al.  The geometry of logconcave functions and sampling algorithms , 2007, Random Struct. Algorithms.

[53]  Arlene K. H. Kim Obtaining minimax lower bounds: a review , 2020, Journal of the Korean Statistical Society.

[54]  B. Park,et al.  Choice of neighbor order in nearest-neighbor classification , 2008, 0810.5276.

[55]  Gilles Blanchard,et al.  Classification with Asymmetric Label Noise: Consistency and Maximal Denoising , 2013, COLT.

[56]  Massimiliano Pontil,et al.  The Benefit of Multitask Representation Learning , 2015, J. Mach. Learn. Res..

[57]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[58]  R. Samworth Recent Progress in Log-Concave Density Estimation , 2017, Statistical Science.

[59]  L. Duembgen,et al.  APPROXIMATION BY LOG-CONCAVE DISTRIBUTIONS, WITH APPLICATIONS TO REGRESSION , 2010, 1002.3448.

[60]  E. J. McShane,et al.  Extension of range of functions , 1934 .

[61]  Sébastien Gadat,et al.  Classification in general finite dimensional spaces with the k-nearest neighbor rule , 2016 .

[62]  Clayton Scott,et al.  A Generalized Neyman-Pearson Criterion for Optimal Domain Adaptation , 2018, ALT.

[63]  Tyler Lu,et al.  Impossibility Theorems for Domain Adaptation , 2010, AISTATS.

[64]  Ata Kabán,et al.  Classification with unknown class conditional label noise on non-compact feature spaces , 2019, COLT.

[65]  Bernhard Schölkopf,et al.  Domain Adaptation under Target and Conditional Shift , 2013, ICML.