Concrete Problems in AI Safety

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

[1]  T. W. Anderson,et al.  Estimation of the Parameters of a Single Equation in a Complete System of Stochastic Equations , 1949 .

[2]  T. W. Anderson,et al.  The Asymptotic Properties of Estimates of the Parameters of a Single Equation in a Complete System of Stochastic Equations , 1950 .

[3]  J. Sargan THE ESTIMATION OF ECONOMIC RELATIONSHIPS USING INSTRUMENTAL VARIABLES , 1958 .

[4]  J. Sargan The Estimation of Relationships with Autocorrelated Residuals by the Use of Instrumental Variables , 1959 .

[5]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[6]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[7]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[8]  B. Efron Computers and the Theory of Statistics: Thinking the Unthinkable , 1979 .

[9]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[10]  John McCarthy,et al.  SOME PHILOSOPHICAL PROBLEMS FROM THE STANDPOINT OF ARTI CIAL INTELLIGENCE , 1987 .

[11]  L. Hansen Large Sample Properties of Generalized Method of Moments Estimators , 1982 .

[12]  C. Goodhart Problems of Monetary Management: The UK Experience , 1984 .

[13]  Michèle Basseville,et al.  Detecting changes in signals and systems - A survey , 1988, Autom..

[14]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[15]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[16]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, Comput. Linguistics.

[17]  Oren Etzioni,et al.  The First Law of Robotics (A Call to Arms) , 1994, AAAI.

[18]  T. Basar,et al.  H∞-0ptimal Control and Related Minimax Design Problems: A Dynamic Game Approach , 1996, IEEE Trans. Autom. Control..

[19]  Adeboyejo A. Thompson,et al.  Artificial Evolution in the Physical World , 1997 .

[20]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[21]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[22]  John Lygeros,et al.  Controllers for reachability specifications for hybrid systems , 1999, Autom..

[23]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[24]  Andrew Y. Ng,et al.  Algorithms for Inverse Reinforcement Learning , 2000, ICML.

[25]  Mark Herbster,et al.  Tracking the Best Linear Predictor , 2001, J. Mach. Learn. Res..

[26]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[27]  Jon Bird,et al.  The evolved radio and its implications for modelling the evolution of novel sensors , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[28]  W. Powell,et al.  Networks and Economic life , 2003 .

[29]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[30]  Alessandra Russo,et al.  Advances in Artificial Intelligence – SBIA 2004 , 2004, Lecture Notes in Computer Science.

[31]  Alexandre M. Bayen,et al.  A time-dependent Hamilton-Jacobi formulation of reachable sets for continuous dynamic games , 2005, IEEE Transactions on Automatic Control.

[32]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[33]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[34]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[35]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[36]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[37]  J. Steinebach E. L. Lehmann, J. P. Romano: Testing statistical hypotheses , 2006 .

[38]  Eliezer Yudkowsky Artificial Intelligence as a Positive and Negative Factor in Global Risk , 2006 .

[39]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[40]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[41]  P. Heymann The On/Off Switch , 2007 .

[42]  Dan Klein,et al.  Analyzing the Errors of Unsupervised Learning , 2008, ACL.

[43]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2011, ICML '08.

[44]  Franco Turini,et al.  Discrimination-aware data mining , 2008, KDD.

[45]  Gideon S. Mann,et al.  Learning from labeled features using generalized expectation criteria , 2008, SIGIR '08.

[46]  Vladimir Vovk,et al.  A tutorial on conformal prediction , 2007, J. Mach. Learn. Res..

[47]  Masashi Sugiyama,et al.  Change-Point Detection in Time-Series Data by Direct Density-Ratio Estimation , 2009, SDM.

[48]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[49]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[50]  J. Pearl Causal inference in statistics: An overview , 2009 .

[51]  Ryan Calo Open Robotics , 2010 .

[52]  Krishnakumar Balasubramanian,et al.  Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels , 2010, J. Mach. Learn. Res..

[53]  Blaine Nelson,et al.  The security of machine learning , 2010, Machine Learning.

[54]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data , 2010, J. Mach. Learn. Res..

[55]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[56]  Sebastian Thrun,et al.  Towards fully autonomous driving: Systems and algorithms , 2011, 2011 IEEE Intelligent Vehicles Symposium (IV).

[57]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[58]  John Blitzer,et al.  Domain Adaptation with Coupled Subspaces , 2011, AISTATS.

[59]  Krishnakumar Balasubramanian,et al.  Unsupervised Supervised Learning II: Margin-Based Classification without Labels , 2011, AISTATS.

[60]  Laurent Orseau,et al.  Delusion, Survival, and Intelligent Agents , 2011, AGI.

[61]  Daniel Dewey,et al.  Learning What to Value , 2011, AGI.

[62]  André Platzer,et al.  Towards Formal Verification of Freeway Traffic Control , 2012, 2012 IEEE/ACM Third International Conference on Cyber-Physical Systems.

[63]  Russ Tedrake,et al.  Finite-time regional verification of stochastic non-linear systems , 2012, Int. J. Robotics Res..

[64]  Bill Hibbard,et al.  Model-based Utility Functions , 2011, J. Artif. Gen. Intell..

[65]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[66]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[67]  Joaquin Quiñonero Candela,et al.  Counterfactual Reasoning and Learning Systems , 2012, ArXiv.

[68]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[69]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[70]  Toniann Pitassi,et al.  Learning Fair Representations , 2013, ICML.

[71]  Nigel Collier,et al.  Change-Point Detection in Time-Series Data by Relative Density-Ratio Estimation , 2011, Neural Networks.

[72]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[73]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[74]  André Platzer,et al.  Formal verification of distributed aircraft controllers , 2013, HSCC '13.

[75]  Christoph Salge,et al.  Empowerment - an Introduction , 2013, ArXiv.

[76]  Oliver Kroemer,et al.  Active Reward Learning , 2014, Robotics: Science and Systems.

[77]  Bernhard Schölkopf,et al.  Causal discovery with continuous additive noise models , 2013, J. Mach. Learn. Res..

[78]  L. Hansen Nobel Lecture: Uncertainty Outside and Inside Economic Models , 2014, Journal of Political Economy.

[79]  H. Hirsh,et al.  Amplify scientific discovery with artificial intelligence , 2014, Science.

[80]  Tomás Svoboda,et al.  Safe Exploration Techniques for Reinforcement Learning - An Overview , 2014, MESAS.

[81]  D. Sculley,et al.  Machine Learning: The High Interest Credit Card of Technical Debt , 2014 .

[82]  Shie Mannor,et al.  Policy Gradients Beyond Expectations: Conditional Value-at-Risk , 2014, ArXiv.

[83]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[84]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[85]  Charles Elkan,et al.  Differential Privacy and Machine Learning: a Survey and Review , 2014, ArXiv.

[86]  Roman V. Yampolskiy,et al.  Utility function security in artificially intelligent agents , 2014, J. Exp. Theor. Artif. Intell..

[87]  Tom M. Mitchell,et al.  Estimating Accuracy from Unlabeled Data , 2014, UAI.

[88]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[89]  François Laviolette,et al.  Domain-Adversarial Neural Networks , 2014, ArXiv.

[90]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[91]  Xi Chen,et al.  Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing , 2014, J. Mach. Learn. Res..

[92]  Michael Marien,et al.  The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies , 2014 .

[93]  Daniel Dewey,et al.  Reinforcement Learning and the Reward Engineering Principle , 2014, AAAI Spring Symposia.

[94]  Krishna P. Gummadi,et al.  Learning Fair Classifiers , 2015, 1507.05259.

[95]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[96]  Shie Mannor,et al.  Optimizing the CVaR via Sampling , 2014, AAAI.

[97]  Benja Fallenstein,et al.  Toward Idealized Decision Theory , 2015, ArXiv.

[98]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[99]  Xiaojin Zhu,et al.  The Security of Latent Dirichlet Allocation , 2015, AISTATS.

[100]  Zhi-Hua Zhou,et al.  Towards Making Unlabeled Data Never Hurt , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[101]  Christopher D. Manning,et al.  On-the-Job Learning with Bayesian Decision Theory , 2015, NIPS.

[102]  Ernest Davis,et al.  Ethical guidelines for a superintelligence , 2015, Artif. Intell..

[103]  Vijay S. Pande,et al.  Massively Multitask Networks for Drug Discovery , 2015, ArXiv.

[104]  Yuval Kluger,et al.  Estimating the accuracies of multiple classifiers without labeled data , 2014, AISTATS.

[105]  Xiaojin Zhu,et al.  Using Machine Teaching to Identify Optimal Training-Set Attacks on Machine Learners , 2015, AAAI.

[106]  Percy Liang,et al.  Calibrated Structured Prediction , 2015, NIPS.

[107]  Stuart Armstrong,et al.  Motivated Value Selection for Artificial Agents , 2015, AAAI Workshop: AI and Ethics.

[108]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[109]  Alexander Mordvintsev,et al.  Inceptionism: Going Deeper into Neural Networks , 2015 .

[110]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[111]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[112]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[113]  Stuart J. Russell,et al.  Research Priorities for Robust and Beneficial Artificial Intelligence , 2015, AI Mag..

[114]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[115]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[116]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[117]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[118]  Jean-Baptiste Jeannin,et al.  A Formally Verified Hybrid System for the Next-Generation Airborne Collision Avoidance System , 2015, TACAS.

[119]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[120]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[121]  M. Arntz,et al.  The Risk of Automation for Jobs in OECD Countries: A Comparative Analysis , 2016 .

[122]  James Babcock,et al.  The AGI Containment Problem , 2016, AGI.

[123]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[124]  Jessica Taylor Quantilizers: A Safer Alternative to Maximizers for Limited Optimization , 2016, AAAI Workshop: AI, Ethics, and Society.

[125]  Noah D. Goodman,et al.  Learning the Preferences of Ignorant, Inconsistent Agents , 2015, AAAI.

[126]  Christopher De Sa,et al.  Incremental knowledge base construction using DeepDive , 2015, The VLDB Journal.

[127]  Nate Soares,et al.  Asymptotic Convergence in Online Learning with Unbounded Delays , 2016, ArXiv.

[128]  Lukasz Kaiser,et al.  Neural GPUs Learn Algorithms , 2015, ICLR.

[129]  Nate Soares,et al.  Uniform Coherence , 2016, ArXiv.

[130]  Uri Shalit,et al.  Learning Representations for Counterfactual Inference , 2016, ICML.

[131]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[132]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[133]  Ananthram Swami,et al.  Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples , 2016, ArXiv.

[134]  Shalini Ghosh,et al.  Trusted Machine Learning for Probabilistic Models , 2016 .

[135]  Andreas Krause,et al.  Safe Exploration in Finite Markov Decision Processes with Gaussian Processes , 2016, NIPS.

[136]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[137]  Laurent Orseau,et al.  Safely Interruptible Agents , 2016, UAI.

[138]  Andrew Critch,et al.  Parametric Bounded Löb's Theorem and Robust Cooperation of Bounded Agents , 2016, ArXiv.

[139]  Marcus Hutter,et al.  Avoiding Wireheading with Value Reinforcement Learning , 2016, AGI.

[140]  Sorelle A. Friedler,et al.  Hiring by Algorithm: Predicting and Preventing Disparate Impact , 2016 .

[141]  Uri Shalit,et al.  Bounding and Minimizing Counterfactual Error , 2016, ArXiv.

[142]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[143]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[144]  Marcus Hutter,et al.  Self-Modification of Policy and Utility Function in Rational Agents , 2016, AGI.

[145]  J. Steinhardt Unsupervised Risk Estimation with only Structural Assumptions , 2016 .

[146]  Thomas Brox,et al.  Synthesizing the preferred inputs for neurons in neural networks via deep generator networks , 2016, NIPS.

[147]  Gregory Valiant,et al.  Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction , 2016, NIPS.

[148]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[149]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[150]  Uri Shalit,et al.  Estimating individual treatment effect: generalization bounds and algorithms , 2016, ICML.

[151]  Ananthram Swami,et al.  Practical Black-Box Attacks against Machine Learning , 2016, AsiaCCS.

[152]  Michael A. Osborne,et al.  The future of employment: How susceptible are jobs to computerisation? , 2017 .

[153]  A. Shamsai,et al.  Multi-objective Optimization , 2017, Encyclopedia of Machine Learning and Data Mining.

[154]  C. Robert Superintelligence: Paths, Dangers, Strategies , 2017 .

[155]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.