PAC Bayesian Performance Guarantees for Deep (Stochastic) Networks in Medical Imaging

Application of deep neural networks to medical imaging tasks has in some sense become commonplace. Still, a "thorn in the side" of the deep learning movement is the argument that deep networks are prone to overfitting and are thus unable to generalize well when datasets are small (as is common in medical imaging tasks). One way to bolster confidence is to provide mathematical guarantees, or bounds, on network performance after training which explicitly quantify the possibility of overfitting. In this work, we explore recent advances using the PAC-Bayesian framework to provide bounds on generalization error for large (stochastic) networks. While previous efforts focus on classification in larger natural image datasets (e.g., MNIST and CIFAR-10), we apply these techniques to both classification and segmentation in a smaller medical imagining dataset: the ISIC 2018 challenge set. We observe the resultant bounds are competitive compared to a simpler baseline, while also being more explainable and alleviating the need for holdout sets.

[1]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[2]  John Langford,et al.  Quantitatively tight sample complexity bounds , 2002 .

[3]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[4]  Peter L. Bartlett,et al.  For Valid Generalization the Size of the Weights is More Important than the Size of the Network , 1996, NIPS.

[5]  Andreas Doerr,et al.  Learning Gaussian Processes by Minimizing PAC-Bayesian Generalization Bounds , 2018, NeurIPS.

[6]  François Laviolette,et al.  PAC-Bayesian learning of linear classifiers , 2009, ICML '09.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Peter L. Bartlett,et al.  Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..

[9]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[10]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[11]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[12]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[14]  Noel C. F. Codella,et al.  Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC) , 2019, ArXiv.

[15]  Kai Zheng,et al.  Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints , 2017, COLT.

[16]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[17]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[18]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[19]  John Langford,et al.  Microchoice Bounds and Self Bounding Learning Algorithms , 2003, Machine Learning.

[20]  Christoph H. Lampert,et al.  Data-Dependent Stability of Stochastic Gradient Descent , 2017, ICML.

[21]  David A. McAllester A PAC-Bayesian Tutorial with A Dropout Bound , 2013, ArXiv.

[22]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[23]  Ioannis Mitliagkas,et al.  In Search of Robust Measures of Generalization , 2020, NeurIPS.

[24]  Benjamin Guedj,et al.  A Primer on PAC-Bayesian Learning , 2019, ICML 2019.

[25]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[26]  Andreas Maurer,et al.  A Note on the PAC Bayesian Theorem , 2004, ArXiv.

[27]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[28]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[29]  Shiliang Sun,et al.  PAC-Bayes bounds for stable algorithms with instance-dependent priors , 2018, NeurIPS.

[30]  Shiliang Sun,et al.  PAC-bayes bounds with data dependent priors , 2012, J. Mach. Learn. Res..

[31]  Csaba Szepesvári,et al.  PAC-Bayes with Backprop , 2019, ArXiv.

[32]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[33]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[34]  Yoav Freund,et al.  Self bounding learning algorithms , 1998, COLT' 98.

[35]  David McAllester,et al.  PAC-Bayesian Theory , 2013, Empirical Inference.

[36]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[37]  Carlo Baldassi,et al.  Subdominant Dense Clusters Allow for Simple Learning and High Computational Performance in Neural Networks with Discrete Synapses. , 2015, Physical review letters.

[38]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[39]  O. Catoni PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[40]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[41]  John Langford,et al.  (Not) Bounding the True Error , 2001, NIPS.

[42]  François Laviolette,et al.  Risk bounds for the majority vote: from a PAC-Bayesian analysis to a learning algorithm , 2015, J. Mach. Learn. Res..

[43]  Gintare Karolina Dziugaite,et al.  On the role of data in PAC-Bayes bounds , 2021, AISTATS.

[44]  Christian Borgs,et al.  Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes , 2016, Proceedings of the National Academy of Sciences.

[45]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[46]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[47]  John Shawe-Taylor,et al.  A PAC analysis of a Bayesian estimator , 1997, COLT '97.

[48]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[49]  Seyed-Ahmad Ahmadi,et al.  V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[50]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[51]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[52]  Gintare Karolina Dziugaite,et al.  Entropy-SGD optimizes the prior of a PAC-Bayes bound: Data-dependent PAC-Bayes priors via differential privacy , 2017, NeurIPS.

[53]  John Shawe-Taylor,et al.  Tighter PAC-Bayes Bounds , 2006, NIPS.

[54]  Csaba Szepesvari,et al.  Tighter risk certificates for neural networks , 2020, J. Mach. Learn. Res..