Non-Negative Bregman Divergence Minimization for Deep Direct Density Ratio Estimation

The estimation of the ratio of two probability densities has garnered attention as the density ratio is useful in various machine learning tasks, such as anomaly detection and domain adaptation. To estimate the density ratio, methods collectively known as direct density ratio estimation (DRE) have been explored. These methods are based on the minimization of the Bregman (BR) divergence between a density ratio model and the true density ratio. However, existing direct DRE suffers from serious overfitting when using flexible models such as neural networks. In this paper, we introduce a non-negative correction for empirical risk using only the prior knowledge of the upper bound of the density ratio. This correction makes a DRE method more robust against overfitting and enables the use of flexible models. In the theoretical analysis, we discuss the consistency of the empirical risk. In our experiments, the proposed estimators show favorable performance in inlier-based outlier detection and covariate shift adaptation.

[1]  Kilian Q. Weinberger,et al.  Marginalized Denoising Autoencoders for Domain Adaptation , 2012, ICML.

[2]  M. Kawanabe,et al.  Direct importance estimation for covariate shift adaptation , 2008 .

[3]  Takafumi Kanamori,et al.  $f$ -Divergence Estimation and Two-Sample Homogeneity Test Under Semiparametric Density-Ratio Models , 2010, IEEE Transactions on Information Theory.

[4]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[5]  P. Rosenbaum Model-Based Direct Adjustment , 1987 .

[6]  Takafumi Kanamori,et al.  Least-squares two-sample test , 2011, Neural Networks.

[7]  Elias Bareinboim,et al.  External Validity: From Do-Calculus to Transportability Across Populations , 2014, Probabilistic and Causal Inference.

[8]  Karl Pearson F.R.S. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .

[9]  Ryan P. Adams,et al.  Bayesian Online Changepoint Detection , 2007, 0710.3742.

[10]  Gang Niu,et al.  Convex Formulation for Learning from Positive and Unlabeled Data , 2015, ICML.

[11]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[12]  Harold Soh,et al.  Refining Deep Generative Models via Wasserstein Gradient Flows , 2020, ArXiv.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  B. Brodsky,et al.  Nonparametric Methods in Change Point Problems , 1993 .

[15]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[17]  J. Robins,et al.  Locally Robust Semiparametric Estimation , 2016, Econometrica.

[18]  Nikos Vlassis,et al.  More Efficient Off-Policy Evaluation through Regularized Targeted Learning , 2019, ICML.

[19]  See-Kiong Ng,et al.  Positive Unlabeled Leaning for Time Series Classification , 2011, IJCAI.

[20]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[21]  Gang Niu,et al.  Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning , 2016, NIPS.

[22]  Masashi Sugiyama,et al.  Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation , 2012 .

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[25]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[26]  M. Gustafsson Surpassing the lateral resolution limit by a factor of two using structured illumination microscopy , 2000, Journal of microscopy.

[27]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[28]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[29]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[30]  Takafumi Kanamori,et al.  Inlier-Based Outlier Detection via Direct Density Ratio Estimation , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[31]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[32]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[33]  Masashi Sugiyama,et al.  Direct Density Ratio Estimation with Convolutional Neural Networks with Application in Outlier Detection , 2014, IEICE Trans. Inf. Syst..

[34]  Masahiro Kato,et al.  Learning from Positive and Unlabeled Data with a Selection Bias , 2018, ICLR.

[35]  S. Eguchi,et al.  Importance Sampling Via the Estimated Sampler , 2007 .

[36]  Masashi Sugiyama,et al.  Direct Importance Estimation with a Mixture of Probabilistic Principal Component Analyzers , 2010, IEICE Trans. Inf. Syst..

[37]  Gang Niu,et al.  Positive-Unlabeled Learning with Non-Negative Risk Estimator , 2017, NIPS.

[38]  Takafumi Kanamori,et al.  Density Ratio Estimation in Machine Learning , 2012 .

[39]  Mayu Otani,et al.  Density-Ratio Based Personalised Ranking from Implicit Feedback , 2021, WWW.

[40]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[41]  Y. Qin Inferences for case-control and semiparametric two-sample density ratio models , 1998 .

[42]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[43]  Kenji Yamanishi,et al.  A unifying framework for detecting outliers and change points from non-stationary time series data , 2002, KDD.

[44]  Gang Niu,et al.  Mitigating Overfitting in Supervised Classification from Two Unlabeled Datasets: A Consistent Risk Correction Approach , 2020, AISTATS.

[45]  J. Robins,et al.  Double/Debiased Machine Learning for Treatment and Structural Parameters , 2017 .

[46]  Takafumi Kanamori,et al.  A Density-ratio Framework for Statistical Data Processing , 2009, IPSJ Trans. Comput. Vis. Appl..

[47]  Masashi Sugiyama,et al.  Direct Importance Estimation with Gaussian Mixture Models , 2009, IEICE Trans. Inf. Syst..

[48]  Shota Yasui,et al.  Efficient Counterfactual Learning from Bandit Feedback , 2018, AAAI.

[49]  Stefan Wager,et al.  Efficient Policy Learning , 2017, ArXiv.

[50]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[51]  Masashi Sugiyama,et al.  Change-Point Detection in Time-Series Data by Direct Density-Ratio Estimation , 2009, SDM.

[52]  Mark J. van der Laan,et al.  Cross-Validated Targeted Minimum-Loss-Based Estimation , 2011 .

[53]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[54]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[55]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[56]  Chris A. J. Klaassen,et al.  Consistent Estimation of the Influence Function of Locally Asymptotically Linear Estimators , 1987 .

[57]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[58]  Jens Hainmueller,et al.  Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies , 2012, Political Analysis.

[59]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[60]  Alexander J. Smola,et al.  Doubly Robust Covariate Shift Correction , 2015, AAAI.

[61]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[62]  Masatoshi Uehara,et al.  Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning , 2019, NeurIPS.

[63]  A. Keziou,et al.  Test of homogeneity in semiparametric two-sample density ratio models , 2005 .

[64]  Masashi Sugiyama,et al.  Rethinking Importance Weighting for Deep Learning under Distribution Shift , 2020, NeurIPS.

[65]  Zhiqiang Tan,et al.  Bounded, efficient and doubly robust estimation with inverse weighting , 2010 .

[66]  Biao Zhang,et al.  Empirical‐likelihood‐based inference in missing response problems and its application in observational studies , 2007 .

[67]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[68]  Mehryar Mohri,et al.  Domain Adaptation in Regression , 2011, ALT.

[69]  Masatoshi Uehara,et al.  Generative Adversarial Nets from a Density Ratio Estimation Perspective , 2016, 1610.02920.

[70]  Steffen Bickel,et al.  Discriminative Learning Under Covariate Shift , 2009, J. Mach. Learn. Res..

[71]  Amor Keziou Utilisation des Divergences entre Mesures en Statistique Inférentielle , 2003 .

[72]  David Sontag,et al.  Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models , 2019, ICML.

[73]  Takafumi Kanamori,et al.  Statistical outlier detection using direct density ratio estimation , 2011, Knowledge and Information Systems.

[74]  Takafumi Kanamori,et al.  Statistical analysis of kernel-based least-squares density-ratio estimation , 2012, Machine Learning.

[75]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[76]  Masashi Sugiyama,et al.  Anomaly Detection by Deep Direct Density Ratio Estimation , 2019 .

[77]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[78]  Dacheng Tao,et al.  Classification with Noisy Labels by Importance Reweighting , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[79]  Cheng Soon Ong,et al.  Linking losses for density ratio and class-probability estimation , 2016, ICML.

[80]  S. Eguchi,et al.  A paradox concerning nuisance parameters and projected estimating functions , 2004 .

[81]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[82]  Mehryar Mohri,et al.  Sample Selection Bias Correction Theory , 2008, ALT.

[83]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[84]  K. Imai,et al.  Covariate balancing propensity score , 2014 .

[85]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[86]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[87]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[88]  Karsten M. Borgwardt,et al.  Covariate Shift by Kernel Mean Matching , 2009, NIPS 2009.

[89]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[90]  Masahiro Kato Identifying Different Definitions of Future in the Assessment of Future Economic Conditions: Application of PU Learning and Text Mining , 2019 .

[91]  E. Hellinger,et al.  Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. , 1909 .

[92]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[93]  U. Paquet Empirical Bayesian Change Point Detection , 2007 .

[94]  Takafumi Kanamori,et al.  A Least-squares Approach to Direct Importance Estimation , 2009, J. Mach. Learn. Res..

[95]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[96]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[97]  Roman Garnett,et al.  Sequential Bayesian prediction in the presence of changepoints , 2009, ICML '09.

[98]  Alexander Binder,et al.  Deep Semi-Supervised Anomaly Detection , 2019, ICLR.

[99]  Alan R. Ellis,et al.  The role of prediction modeling in propensity score estimation: an evaluation of logistic regression, bCART, and the covariate-balancing propensity score. , 2014, American journal of epidemiology.

[100]  Johannes Schmidt-Hieber,et al.  Nonparametric regression using deep neural networks with ReLU activation function , 2017, The Annals of Statistics.

[101]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[102]  John Langford,et al.  The offset tree for learning with partial labels , 2008, KDD.

[103]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2002 .

[104]  C. Chu,et al.  Semiparametric density estimation under a two-sample density ratio model , 2004 .

[105]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[106]  K. Pearson On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1900 .

[107]  Masashi Sugiyama,et al.  Direct Density Ratio Estimation for Large-scale Covariate Shift Adaptation , 2008, SDM.

[108]  Ran El-Yaniv,et al.  Deep Anomaly Detection Using Geometric Transformations , 2018, NeurIPS.

[109]  S. Cole,et al.  Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. , 2010, American journal of epidemiology.

[110]  J. Loevinger,et al.  The technic of homogeneous tests compared with some aspects of scale analysis and factor analysis. , 1948, Psychological bulletin.

[111]  Masahiro Kato,et al.  Off-Policy Evaluation and Learning for External Validity under a Covariate Shift , 2020, NeurIPS.

[112]  Le Song,et al.  Relative Novelty Detection , 2009, AISTATS.

[113]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[114]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .