Intelligence plays dice: Stochasticity is essential for machine learning

Many fields view stochasticity as a way to gain computational efficiency, while often having to trade off accuracy. In this perspective article, we argue that stochasticity plays a fundamentally different role in machine learning (ML) and is likely a critical ingredient of intelligent systems. As we review the ML literature, we notice that stochasticity features in many ML methods, affording them robustness, generalizability, and calibration. We also note that randomness seems to be prominent in biological intelligence, from the spiking patterns of individual neurons to the complex behavior of animals. We conclude with a discussion of how we believe stochasticity might shape the future of ML.

[1]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[2]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[3]  Kilian Q. Weinberger,et al.  Marginalized Denoising Autoencoders for Domain Adaptation , 2012, ICML.

[4]  Liang Lin,et al.  SNAS: Stochastic Neural Architecture Search , 2018, ICLR.

[5]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[6]  Varun Jampani,et al.  Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[7]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Training Pruned Neural Networks , 2018, ArXiv.

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  Michael I. Jordan,et al.  Domain Adaptation with Randomized Multilinear Adversarial Networks , 2017, ArXiv.

[10]  Jochen Braun,et al.  Bistable Perception Modeled as Competing Stochastic Integrations at Two Levels , 2009, PLoS Comput. Biol..

[11]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[12]  Emery N. Brown,et al.  Measuring the signal-to-noise ratio of a neuron , 2015, Proceedings of the National Academy of Sciences.

[13]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[14]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[15]  Idan Segev,et al.  Ion Channel Stochasticity May Be Critical in Determining the Reliability and Precision of Spike Timing , 1998, Neural Computation.

[16]  Ehl Emile Aarts,et al.  Simulated annealing and Boltzmann machines , 2003 .

[17]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[18]  Eugene M. Kleinberg,et al.  On the Algorithmic Implementation of Stochastic Discrimination , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[20]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[21]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[24]  Dmitry P. Vetrov,et al.  Structured Bayesian Pruning via Log-Normal Multiplicative Noise , 2017, NIPS.

[25]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  J. Ross Quinlan,et al.  Simplifying decision trees , 1987, Int. J. Hum. Comput. Stud..

[28]  Logan Engstrom,et al.  Synthesizing Robust Adversarial Examples , 2017, ICML.

[29]  Mert R. Sabuncu,et al.  Confidence Calibration for Convolutional Neural Networks Using Structured Dropout , 2019, ArXiv.

[30]  Brian Kingsbury,et al.  Estimating Information Flow in Deep Neural Networks , 2018, ICML.

[31]  David E. Goldberg,et al.  Genetic algorithms and Machine Learning , 1988, Machine Learning.

[32]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[33]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[34]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[35]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[36]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[37]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[38]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[39]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[40]  Swami Sankaranarayanan,et al.  Learning from Synthetic Data: Addressing Domain Shift for Semantic Segmentation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[42]  Luis Perez,et al.  The Effectiveness of Data Augmentation in Image Classification using Deep Learning , 2017, ArXiv.

[43]  Kamyar Azizzadenesheli,et al.  Stochastic Activation Pruning for Robust Adversarial Defense , 2018, ICLR.

[44]  Gustavo Deco,et al.  Stochastic dynamics as a principle of brain function , 2009, Progress in Neurobiology.

[45]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[46]  Erik Hoel,et al.  The overfitted brain: Dreams evolved to assist generalization , 2020, Patterns.

[47]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[48]  Michael N. Shadlen,et al.  Noise, neural codes and cortical organization , 1994, Current Opinion in Neurobiology.

[49]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[50]  Sasha R. X. Dall,et al.  Life-history trade-offs mediate 'personality' variation in two colour morphs of the pea aphid, Acyrthosiphon pisum. , 2014, The Journal of animal ecology.

[51]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[52]  Brian Kingsbury,et al.  Estimating Information Flow in Neural Networks , 2018, ArXiv.

[53]  P. Miller,et al.  Stochastic Transitions between Neural States in Taste Processing and Decision-Making , 2010, The Journal of Neuroscience.

[54]  Qi Tian,et al.  DisturbLabel: Regularizing CNN on the Loss Layer , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Shimeng Yu,et al.  Stochastic learning in oxide binary synaptic device for neuromorphic computing , 2013, Front. Neurosci..

[56]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[57]  Ameet Talwalkar,et al.  Random Search and Reproducibility for Neural Architecture Search , 2019, UAI.

[58]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[59]  Xiao-Jing Wang,et al.  Neural mechanism for stochastic behaviour during a competitive game , 2006, Neural Networks.

[60]  Martin Wattenberg,et al.  SmoothGrad: removing noise by adding noise , 2017, ArXiv.

[61]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[62]  Suyun Liu,et al.  Accuracy and Fairness Trade-offs in Machine Learning: A Stochastic Multi-Objective Approach , 2020, ArXiv.

[63]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[64]  Max Welling,et al.  Group Equivariant Convolutional Networks , 2016, ICML.

[65]  Kyle Honegger,et al.  Stochasticity, individuality and behavior , 2018, Current Biology.

[66]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[67]  Yuanzhi Li,et al.  An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.

[68]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[69]  Christian A. Yates,et al.  Inherent noise can facilitate coherence in collective swarm motion , 2009, Proceedings of the National Academy of Sciences.

[70]  Benjamin Recht,et al.  Simple random search of static linear policies is competitive for reinforcement learning , 2018, NeurIPS.

[71]  P. Walker Stochastic properties of binocular rivalry alternations , 1975 .

[72]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[73]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[74]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[75]  Seth Lloyd,et al.  Adversarial robustness guarantees for random deep neural networks , 2020, ArXiv.

[76]  G. Vogt,et al.  Production of different phenotypes from the same genotype in the same environment by developmental variation , 2008, Journal of Experimental Biology.