Information-Theoretic Bounds on Transfer Generalization Gap Based on Jensen-Shannon Divergence

In transfer learning, training and testing data sets are drawn from different data distributions. The transfer generalization gap is the difference between the population loss on the target data distribution and the training loss. The training data set generally includes data drawn from both source and target distributions. This work presents novel information-theoretic upper bounds on the average transfer generalization gap that capture ($i$) the domain shift between the target data distribution $P_{Z}^{\prime}$ and the source distribution $P_{Z}$ through a two-parameter family of generalized $(\alpha_{1},\ \alpha_{2})$ -Jensen-Shannon (JS) divergences; and (ii) the sensitivity of the transfer learner output $W$ to each individual sample of the data set $Z_{i}$ via the mutual information $I(W;Z_{i})$. For $\alpha_{1}\in(0,1)$, the $(\alpha_{1},\ \alpha_{2})$ - JS divergence can be bounded even when the support of $P_{Z}$ is not included in that of $P_{Z}^{\prime}$. This contrasts the Kullback-Leibler (KL) divergence $D_{\text{KL}}(P_{Z}\Vert P_{Z}^{\prime})$ -based bounds of Wu et al. [1], which are vacuous under this assumption. Moreover, the obtained bounds hold for unbounded loss functions with bounded cumulant generating functions, unlike the $\phi$ -divergence based bound of Wu et al. [1]. We also obtain new upper bounds on the average transfer excess risk in terms of the $(\alpha_{1},\ \alpha_{2})$ -JS divergence for empirical weighted risk minimization (EWRM), which minimizes the weighted average training losses over source and target data sets. Finally, we provide a numerical example to illustrate the merits of the introduced bounds.

[1]  Emilio Soria Olivas,et al.  Handbook of Research on Machine Learning Applications and Trends : Algorithms , Methods , and Techniques , 2009 .

[2]  Frank Nielsen,et al.  On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid , 2019, Entropy.

[3]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[4]  Shaofeng Zou,et al.  Tightening Mutual Information Based Bounds on Generalization Error , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[5]  Giuseppe Durisi,et al.  Generalization Bounds via Information Density and Conditional Information Density , 2020, IEEE Journal on Selected Areas in Information Theory.

[6]  Frank Nielsen,et al.  A family of statistical symmetric divergences based on Jensen's inequality , 2010, ArXiv.

[7]  Yishay Mansour,et al.  Domain Adaptation: Learning Bounds and Algorithms , 2009, COLT.

[8]  Lei Zhang,et al.  Generalization Bounds for Domain Adaptation , 2012, NIPS.

[9]  Takuya Yamano,et al.  Some bounds for skewed α-Jensen-Shannon divergence , 2019, Results in Applied Mathematics.

[10]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[11]  Qi Chen,et al.  Beyond H-Divergence: Domain Adaptation Theory With Jensen-Shannon Divergence , 2020, ArXiv.

[12]  F. Alajaji,et al.  Lectures Notes in Information Theory , 2000 .

[13]  Jonathan H. Manton,et al.  Information-theoretic analysis for transfer learning , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[14]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[15]  Gavriel Salomon,et al.  T RANSFER OF LEARNING , 1992 .