Estimation and Applications of Quantiles in Deep Binary Classification

Quantile regression, based on check loss, is a widely used inferential paradigm in Econometrics and Statistics. The conditional quantiles provide a robust alternative to classical conditional means, and also allow uncertainty quantification of the predictions, while making very few distributional assumptions. We consider the analogue of check loss in the binary classification setting. We assume that the conditional quantiles are smooth functions that can be learnt by Deep Neural Networks (DNNs). Subsequently, we compute the Lipschitz constant of the proposed loss, and also show that its curvature is bounded, under some regularity conditions. Consequently, recent results on the error rates and DNN architecture complexity become directly applicable. We quantify the uncertainty of the class probabilities in terms of prediction intervals, and develop individualized confidence scores that can be used to decide whether a prediction is reliable or not at scoring time. By aggregating the confidence scores at the dataset level, we provide two additional metrics, model confidence, and retention rate, to complement the widely used classifier summaries. We also the robustness of the proposed non-parametric binary quantile classification framework are also studied, and we demonstrate how to obtain several univariate summary statistics of the conditional distributions, in particular conditional means, using smoothed conditional quantiles, allowing the use of explanation techniques like Shapley to explain the mean predictions. Finally, we demonstrate an efficient training regime for this loss based on Stochastic Gradient Descent with Lipschitz Adaptive Learning Rates (LALR).

[1]  Kevin Scaman,et al.  Lipschitz regularity of deep neural networks: analysis and efficient estimation , 2018, NeurIPS.

[2]  Maya R. Gupta,et al.  To Trust Or Not To Trust A Classifier , 2018, NeurIPS.

[3]  H. Zou,et al.  Composite quantile regression and the oracle Model Selection Theory , 2008, 0806.2905.

[4]  Pravin K. Trivedi,et al.  Microeconometrics Using Stata: Revised Edition , 2010 .

[5]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[6]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[7]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Snehanshu Saha,et al.  LALR: Theoretical and Experimental validation of Lipschitz Adaptive Learning Rate in Regression and Neural Networks , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[9]  Naomi S. Altman,et al.  Quantile regression , 2019, Nature Methods.

[10]  E. Candès,et al.  A modern maximum-likelihood theory for high-dimensional logistic regression , 2018, Proceedings of the National Academy of Sciences.

[11]  Jon Howell,et al.  Asirra: a CAPTCHA that exploits interest-aligned manual image categorization , 2007, CCS '07.

[12]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[13]  Sanjog Misra,et al.  Deep Neural Networks for Estimation and Inference: Application to Causal Effects and Other Semiparametric Estimands , 2018, Econometrica.

[14]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[15]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[16]  Daniel S. Kermany,et al.  Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning , 2018, Cell.

[17]  George D. Magoulas,et al.  Nonmonotone methods for backpropagation training with adaptive learning rate , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[18]  George D. Magoulas,et al.  Effective Backpropagation Training with Variable Stepsize , 1997, Neural Networks.

[19]  L. Armijo Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .

[20]  Alexander J. Smola,et al.  Nonparametric Quantile Estimation , 2006, J. Mach. Learn. Res..

[21]  Reshma Rastogi,et al.  A new asymmetric ε-insensitive pinball loss function based support vector quantile regression model , 2020, Appl. Soft Comput..

[22]  Dirk Van den Poel,et al.  Binary quantile regression: a Bayesian approach based on the asymmetric Laplace distribution , 2012 .

[23]  E. Parzen Nonparametric Statistical Data Modeling , 1979 .

[24]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[25]  Snehanshu Saha,et al.  LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence , 2019, Applied Intelligence.

[26]  Kjell A. Doksum,et al.  On average derivative quantile regression , 1997 .

[27]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[28]  D. Paindaveine,et al.  Multivariate quantiles and multiple-output regression quantiles: from L1 optimization to halfspace depth , 2010, 1002.4486.

[29]  C. Manski Semiparametric analysis of discrete response: Asymptotic properties of the maximum score estimator , 1985 .

[30]  Emanuel Parzen,et al.  Quantile Probability and Statistical Data Modeling , 2004 .

[31]  P. Chaudhuri On a geometric notion of quantiles for multivariate data , 1996 .

[32]  Hugh Chen,et al.  From local explanations to global understanding with explainable AI for trees , 2020, Nature Machine Intelligence.

[33]  R. Koenker,et al.  Regression Quantiles , 2007 .

[34]  Roger Koenker,et al.  Adaptive $L$-Estimation for Linear Models , 1989 .

[35]  Sriparna Saha,et al.  Parsimonious Computing: A Minority Training Regime for Effective Prediction in Large Microarray Expression Data Sets , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[36]  Aritra Ghosh,et al.  Robust Loss Functions under Label Noise for Deep Neural Networks , 2017, AAAI.

[37]  J. Horowitz A Smoothed Maximum Score Estimator for the Binary Response Model , 1992 .

[38]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[39]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  S. Shankar Sastry,et al.  Cross-Entropy Loss and Low-Rank Features Have Responsibility for Adversarial Examples , 2019, ArXiv.

[41]  C. Manski MAXIMUM SCORE ESTIMATION OF THE STOCHASTIC UTILITY MODEL OF CHOICE , 1975 .

[42]  Dustin Tran,et al.  Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness , 2020, NeurIPS.

[43]  Gregory Kordas Smoothed binary regression quantiles , 2006 .

[44]  David Lopez-Paz,et al.  Single-Model Uncertainties for Deep Learning , 2018, NeurIPS.

[45]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).