Simple linear classifiers via discrete optimization: learning certifiably optimal scoring systems for decision-making and risk assessment

Scoring systems are linear classification models that let users make quick predictions by adding, subtracting, and multiplying a few small numbers. These models are widely used in applications where humans have traditionally made decisions because they are easy to understand and validate. In spite of extensive deployment, many scoring systems are still built using ad hoc approaches that combine statistical techniques, heuristics, and expert judgement. Such approaches impose steep trade-offs with performance, making it difficult for practitioners to build scoring systems that will be used and accepted. In this dissertation, we present two new machine learning methods to learn scoring systems from data: Supersparse Linear Integer Models (SLIM) for decision-making applications; and Risk-calibrated Supersparse Linear Integer Models (RiskSLIM) for risk assessment applications. Both SLIM and RiskSLIM solve discrete optimization problems to learn scoring systems that are fully optimized for feature selection, small integer coefficients, and operational constraints. We formulate these problems as integer programming problems and develop specialized algorithms to recover certifiably optimal solutions with an integer programming solver. We illustrate the benefits of this approach by building scoring systems for realworld problems such as recidivism prediction, sleep apnea screening, ICU seizure prediction, and adult ADHD diagnosis. Our results show that a discrete optimization approach can learn simple models that perform well in comparison to the state-ofthe-art, but that are far easier to customize, understand, and validate. Thesis Supervisor: Cynthia Rudin Title: Associate Professor of Computer Science Duke University

[1]  William Nick Street,et al.  Breast Cancer Diagnosis and Prognosis Via Linear Programming , 1995, Oper. Res..

[2]  D. Gottfredson,et al.  The Mathematics of Risk Classification: Changing Data into Valid Instruments for Juvenile Courts. NCJ 209158. , 2005 .

[3]  V. Kapur,et al.  Obstructive sleep apnea: diagnosis, epidemiology, and economics. , 2010, Respiratory care.

[4]  Toshiki Sato,et al.  Piecewise-Linear Approximation for Feature Subset Selection in a Sequential Logit Model , 2015, ArXiv.

[5]  Peter B. Hoffman,et al.  Twenty years of operational use of a risk prediction instrument: The United States parole commission's salient factor score , 1994 .

[6]  A. Land,et al.  An Automatic Method for Solving Discrete Programming Problems , 1960, 50 Years of Integer Programming.

[7]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[8]  Odalric-Ambrym Maillard,et al.  Concentration inequalities for sampling without replacement , 2013, 1309.4029.

[9]  R. Berk,et al.  Forecasting murder within a population of probationers and parolees: a high stakes application of statistical learning , 2009 .

[10]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[11]  Michael R. Bussieck,et al.  MINLP Solver Software , 2011 .

[12]  Donald Hedeker,et al.  The Altman Self-Rating Mania Scale , 1997, Biological Psychiatry.

[13]  N. Cowan,et al.  The Magical Mystery Four , 2010, Current directions in psychological science.

[14]  Cynthia Rudin,et al.  Box drawings for learning with imbalanced data , 2014, KDD.

[15]  R. Dawes,et al.  Heuristics and Biases: Clinical versus Actuarial Judgment , 2002 .

[16]  H. Belfrage,et al.  Prediction of violence using the HCR-20: a prospective study in two maximum-security correctional institutions , 2000 .

[17]  S. Parthasarathy,et al.  Big-Data or Slim-Data: Predictive Analytics Will Rule with World. , 2016, Journal of clinical sleep medicine : JCSM : official publication of the American Academy of Sleep Medicine.

[18]  G. Unwin,et al.  Scales for the identification of adults with attention deficit hyperactivity disorder (ADHD): a systematic review. , 2011, Research in developmental disabilities.

[19]  Aj Beck,et al.  Recidivism of prisoners released in 1983 , 1989 .

[20]  N. Tollenaar,et al.  Which method predicts recidivism best?: a comparison of statistical, machine learning and data mining predictive models , 2013 .

[21]  R. Chervin,et al.  The Epworth Sleepiness Scale may not reflect objective measures of sleepiness or sleep apnea , 1999, Neurology.

[22]  Paul A. Rubin,et al.  Heuristic solution procedures for a mixed‐integer programming discriminant model , 1990 .

[23]  Yufeng Liu,et al.  Robust Truncated Hinge Loss Support Vector Machines , 2007 .

[24]  Richard A. Berk,et al.  Statistical Procedures for Forecasting Criminal Behavior , 2013 .

[25]  Matthew S. Crow The Complexities of Prior Record, Race, Ethnicity, and Policy , 2008 .

[26]  Sören Sonnenburg,et al.  Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization , 2009, J. Mach. Learn. Res..

[27]  Steven Finlay,et al.  Credit Scoring, Response Modelling and Insurance Rating: A Practical Guide to Forecasting Consumer Behaviour , 2010 .

[28]  Ron Kohavi,et al.  Targeting Business Users with Decision Table Classifiers , 1998, KDD.

[29]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[30]  Kelly Hannah-Moffat Actuarial Sentencing: An “Unsettled” Proposition , 2013 .

[31]  Amos Fiat,et al.  Decision Trees: More Theoretical Justification for Practical Algorithms , 2004, ALT.

[32]  Cynthia Rudin,et al.  Supersparse Linear Integer Models for Predictive Scoring Systems , 2013, AAAI.

[33]  S. Faraone,et al.  Do stimulants improve functioning in adults with ADHD? A review of the literature , 2013, European Neuropsychopharmacology.

[34]  R. Pearse,et al.  A national early warning score for acutely ill patients , 2012, BMJ : British Medical Journal.

[35]  Jens Marklof,et al.  Fine-Scale Statistics for the Multidimensional Farey Sequence , 2012, 1207.0954.

[36]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[37]  Najib T. Ayas,et al.  The Economic Impact of Obstructive Sleep Apnea , 2007, Lung.

[38]  Paul A. Rubin,et al.  Solving mixed integer classification problems by decomposition , 1997, Ann. Oper. Res..

[39]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[40]  Yann Chevaleyre,et al.  Rounding Methods for Discrete Linear Classification , 2013, ICML.

[41]  T. B. Üstün,et al.  The Prevalence and Effects of Adult Attention Deficit/Hyperactivity Disorder on Work Performance in a Nationally Representative Sample of Workers , 2005, Journal of occupational and environmental medicine.

[42]  A. J. Feelders,et al.  Pruning for Monotone Classification Trees , 2003, IDA.

[43]  A. Tversky,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.

[44]  Cynthia Rudin,et al.  Optimized Risk Scores , 2017, KDD.

[45]  Yufeng Liu,et al.  Variable Selection via A Combination of the L0 and L1 Penalties , 2007 .

[46]  Balaraman Ravindran,et al.  Learning Interpretable Models Using an Oracle , 2019, ArXiv.

[47]  Bart Baesens,et al.  Performance of classification models from a user perspective , 2011, Decis. Support Syst..

[48]  Alex Alves Freitas,et al.  Comprehensible classification models: a position paper , 2014, SKDD.

[49]  Finale Doshi-Velez,et al.  Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction , 2015, NIPS.

[50]  François Margot,et al.  Symmetry in Integer Linear Programming , 2010, 50 Years of Integer Programming.

[51]  A. Evans,et al.  Translating Clinical Research into Clinical Practice: Impact of Using Prediction Rules To Make Decisions , 2006, Annals of Internal Medicine.

[52]  R. Kessler,et al.  The prevalence and workplace costs of adult attention deficit hyperactivity disorder in a large manufacturing firm , 2008, Psychological Medicine.

[53]  E. Altman Predicting FinancialDistress of Companies : Revisiting the Z-Score and ZETA ® Models * , 2000 .

[54]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[55]  Rich Caruana,et al.  Data mining in metric space: an empirical analysis of supervised learning performance criteria , 2004, ROCAI.

[56]  Robert E. Bixby,et al.  Mixed-Integer Programming: A Progress Report , 2004, The Sharpest Cut.

[57]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[58]  Justin M. Rao,et al.  Precinct or Prejudice? Understanding Racial Disparities in New York City's Stop-and-Frisk Policy , 2016 .

[59]  Wei Guan,et al.  Mixed-Integer Support Vector Machine , 2009 .

[60]  J. Kelder,et al.  Chest pain in the emergency room: value of the HEART score , 2008, Netherlands heart journal : monthly journal of the Netherlands Society of Cardiology and the Netherlands Heart Foundation.

[61]  Cynthia Rudin,et al.  A Practical Risk Score for EEG Seizures in Hospitalized Patients (S11.002) , 2018 .

[62]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[63]  Kush R. Varshney,et al.  Exact Rule Learning via Boolean Compressed Sensing , 2013, ICML.

[64]  Bart Baesens,et al.  Building comprehensible customer churn prediction models with advanced rule induction techniques , 2011, Expert Syst. Appl..

[65]  M. Nuwer,et al.  American Clinical Neurophysiology Society's Standardized Critical Care EEG Terminology: 2012 version. , 2013, Journal of clinical neurophysiology : official publication of the American Electroencephalographic Society.

[66]  Jorge Nocedal,et al.  Knitro: An Integrated Package for Nonlinear Optimization , 2006 .

[67]  Mark D. Reid,et al.  Composite Binary Losses , 2009, J. Mach. Learn. Res..

[68]  Matteo Fischetti,et al.  On handling indicator constraints in mixed integer programming , 2016, Comput. Optim. Appl..

[69]  William M. Grove,et al.  Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures , 1996 .

[70]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[71]  Joseph Biederman,et al.  The age-dependent decline of attention deficit hyperactivity disorder: a meta-analysis of follow-up studies , 2005, Psychological Medicine.

[72]  Maristela Monteiro,et al.  AUDIT - The alcohol use disorders identification test: guidelines for use in primary care. , 2001 .

[73]  Hendrik Blockeel,et al.  Seeing the Forest Through the Trees: Learning a Comprehensible Model from an Ensemble , 2007, ECML.

[74]  Joelle Pineau,et al.  Learning Robust Features using Deep Learning for Automatic Seizure Detection , 2016, MLHC.

[75]  Olvi L. Mangasarian,et al.  Misclassification minimization , 1994, J. Glob. Optim..

[76]  Colin M. Shapiro,et al.  STOP Questionnaire: A Tool to Screen Patients for Obstructive Sleep Apnea , 2008, Anesthesiology.

[77]  Christopher Reid,et al.  Development and validation of the Emergency Department Assessment of Chest pain Score and 2 h accelerated diagnostic protocol , 2014, Emergency medicine Australasia : EMA.

[78]  Jun Sakuma,et al.  Fairness-aware Learning through Regularization Approach , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[79]  Paul E. Utgoff,et al.  Incremental Induction of Decision Trees , 1989, Machine Learning.

[80]  Jordan M. Hyatt,et al.  Document Title: Classifying Adult Probationers by Forecasting Future Offending , 2012 .

[81]  Alexandra M. Newman,et al.  Practical Guidelines for Solving Difficult Mixed Integer Linear , 2013 .

[82]  Charles J. Brainerd,et al.  The importance of mathematics in health and human judgment: Numeracy, risk communication, and medical decision making , 2007 .

[83]  Bart Baesens,et al.  An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models , 2011, Decis. Support Syst..

[84]  Xiaoqian Jiang,et al.  Predicting accurate probabilities with a ranking loss , 2012, ICML.

[85]  Bart Baesens,et al.  Comprehensible Credit Scoring Models Using Rule Extraction from Support Vector Machines , 2007, Eur. J. Oper. Res..

[86]  Cynthia Rudin,et al.  Learning Optimized Risk Scores on Large-Scale Datasets , 2016 .

[87]  Francis R. Bach,et al.  Structured Variable Selection with Sparsity-Inducing Norms , 2009, J. Mach. Learn. Res..