Learning Certifiably Optimal Rule Lists

We present the design and implementation of a custom discrete optimization technique for building rule lists over a categorical feature space. Our algorithm produces rule lists with optimal training performance, according to the regularized empirical risk, with a certificate of optimality. By leveraging algorithmic bounds, efficient data structures, and computational reuse, we achieve several orders of magnitude speedup in time and a massive reduction of memory consumption. We demonstrate that our approach produces optimal rule lists on practical problems in seconds. Our results indicate that it is possible to construct optimal sparse rule lists that are approximately as accurate as the COMPAS proprietary risk prediction tool on data from Broward County, Florida, but that are completely interpretable. This framework is a novel alternative to CART and other decision tree methods for interpretable modeling.

[1]  Acknowledgments , 2006, Molecular and Cellular Endocrinology.

[2]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[3]  Cynthia Rudin,et al.  A Minimax Surrogate Loss Approach to Conditional Difference Estimation , 2018, ArXiv.

[4]  Tong Wang Hybrid Decision Making: When Interpretable Models Collaborate With Black-Box Models , 2018, ArXiv.

[5]  Martin W. P. Savelsbergh,et al.  A Computational Study of Search Strategies for Mixed Integer Programming , 1999, INFORMS J. Comput..

[6]  Cynthia Rudin,et al.  An optimization approach to learning falling rule lists , 2017, AISTATS.

[7]  Luc De Raedt,et al.  Inductive Logic Programming: Theory and Methods , 1994, J. Log. Program..

[8]  Edward I. George,et al.  Bayesian Treed Models , 2002, Machine Learning.

[9]  Galit Shmueli,et al.  To Explain or To Predict? , 2010, 1101.0891.

[10]  Siegfried Nijssen,et al.  Optimal constraint-based decision tree induction from itemset lattices , 2010, Data Mining and Knowledge Discovery.

[11]  Marc Goessling,et al.  Directional decision lists , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[12]  Cynthia Rudin,et al.  Supersparse linear integer models for optimized medical scoring systems , 2015, Machine Learning.

[13]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[14]  Margo I. Seltzer,et al.  Scalable Bayesian Rule Lists , 2016, ICML.

[15]  Cynthia Rudin,et al.  A Bayesian Framework for Learning Rule Sets for Interpretable Classification , 2017, J. Mach. Learn. Res..

[16]  H. Chipman,et al.  BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[17]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[18]  Mario Marchand,et al.  Learning with Decision Lists of Data-Dependent Features , 2005, J. Mach. Learn. Res..

[19]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[20]  Cynthia Rudin,et al.  Falling Rule Lists , 2014, AISTATS.

[21]  Russell Greiner,et al.  A Fast Way to Produce Optimal Fixed-Depth Decision Trees , 2008, ISAIM.

[22]  Ronald L. Rivest,et al.  Learning decision lists , 2004, Machine Learning.

[23]  Luc De Raedt,et al.  An experimental evaluation of simplicity in rule learning , 2008, Artif. Intell..

[24]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[25]  Paulo J. G. Lisboa,et al.  Making machine learning models interpretable , 2012, ESANN.

[26]  Dimitrios Gunopulos,et al.  Induction of shallow decision trees , 2007 .

[27]  Cynthia Rudin,et al.  Learning theory analysis for association rules and sequential event prediction , 2013, J. Mach. Learn. Res..

[28]  Koen Vanhoof,et al.  Structure of association rule classifiers: a review , 2010, 2010 IEEE International Conference on Intelligent Systems and Knowledge Engineering.

[29]  Cynthia Rudin,et al.  Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model , 2015, ArXiv.

[30]  Seth Flaxman,et al.  European Union Regulations on Algorithmic Decision-Making and a "Right to Explanation" , 2016, AI Mag..

[31]  H. Chipman,et al.  Bayesian Additive Regression Trees , 2006 .

[32]  Shawn D. Bushway,et al.  Is There Any Logic to Using Logit Finding the Right Tool for the Increasingly Important Job of Risk Prediction , 2013 .

[33]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[34]  Cynthia Rudin,et al.  Box drawings for learning with imbalanced data , 2014, KDD.

[35]  Stefan Rüping,et al.  Learning interpretable models , 2006 .

[36]  R. Dawes Judgment under uncertainty: The robust beauty of improper linear models in decision making , 1979 .

[37]  Cynthia Rudin,et al.  Learning customized and optimized lists of rules with mathematical programming , 2018, Math. Program. Comput..

[38]  Adrian F. M. Smith,et al.  A Bayesian CART algorithm , 1998 .

[39]  Kyuseok Shim,et al.  Efficient algorithms for constructing decision trees with constraints , 2000, KDD '00.

[40]  Cynthia Rudin,et al.  Interpretable classification models for recidivism prediction , 2015, 1503.07810.

[41]  Marie Davidian,et al.  Using decision lists to construct interpretable and parsimonious treatment regimes , 2015, Biometrics.

[42]  Hany Farid,et al.  The accuracy, fairness, and limits of predicting recidivism , 2018, Science Advances.

[43]  Margo Seltzer,et al.  Systems Optimizations for Learning Certifiably Optimal Rule Lists , 2018 .

[44]  Seth Flaxman,et al.  EU regulations on algorithmic decision-making and a "right to explanation" , 2016, ArXiv.

[45]  Justin M. Rao,et al.  Precinct or Prejudice? Understanding Racial Disparities in New York City's Stop-and-Frisk Policy , 2016 .

[46]  Nicholas Larus-Stone,et al.  Learning Certifiably Optimal Rule Lists: A Case for Discrete Optimization in the 21st Century , 2017 .

[47]  Cynthia Rudin,et al.  Causal Falling Rule Lists , 2015, ArXiv.

[48]  H. Chipman,et al.  Bayesian CART Model Search , 1998 .

[49]  Jan A. Kors,et al.  Finding a short and accurate decision rule in disjunctive normal form by exhaustive search , 2010, Machine Learning.

[50]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[51]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[52]  N. Tollenaar,et al.  Which method predicts recidivism best?: a comparison of statistical, machine learning and data mining predictive models , 2013 .

[53]  Ryszard S. Michalski,et al.  On the Quasi-Minimal Solution of the General Covering Problem , 1969 .

[54]  Cynthia Rudin,et al.  Optimized Risk Scores , 2017, KDD.

[55]  References , 1971 .

[56]  Cynthia Rudin,et al.  Deep Learning for Case-based Reasoning through Prototypes: A Neural Network that Explains its Predictions , 2017, AAAI.

[57]  John Shawe-Taylor,et al.  The Decision List Machine , 2002, NIPS.

[58]  Ivan Bratko,et al.  Machine Learning: Between Accuracy and Interpretability , 1997 .

[59]  Alex Alves Freitas,et al.  Comprehensible classification models: a position paper , 2014, SKDD.

[60]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[61]  Cynthia Rudin,et al.  Bayesian Hierarchical Rule Modeling for Predicting Medical Conditions , 2012, 1206.6653.

[62]  Christophe Giraud-Carrier Beyond predictive accuracy : what? , 1998 .

[63]  TreesKristin P. Bennett,et al.  Optimal Decision Trees , 1996 .

[64]  Bart Baesens,et al.  An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models , 2011, Decis. Support Syst..

[65]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.