Knockoffs for the mass: new feature importance statistics with false discovery guarantees

An important problem in machine learning and statistics is to identify features that causally affect the outcome. This is often impossible to do from purely observational data, and a natural relaxation is to identify features that are correlated with the outcome even conditioned on all other observed features. For example, we want to identify that smoking really is correlated with cancer conditioned on demographics. The knockoff procedure is a recent breakthrough in statistics that, in theory, can identify truly correlated features while guaranteeing that false discovery rate is controlled. The idea is to create synthetic data—knockoffs—that capture correlations among the features. However, there are substantial computational and practical challenges to generating and using knockoffs. This paper makes several key advances that enable knockoff application to be more efficient and powerful. We develop an efficient algorithm to generate valid knockoffs from Bayesian Networks. Then we systematically evaluate knockoff test statistics and develop new statistics with improved power. The paper combines new mathematical guarantees with systematic experiments on real and synthetic data.

[1]  A. Raftery Choosing Models for Cross-Classifications , 1986 .

[2]  E. Candès,et al.  A modern maximum-likelihood theory for high-dimensional logistic regression , 2018, Proceedings of the National Academy of Sciences.

[3]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[4]  H. Akaike A new look at the statistical model identification , 1974 .

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[7]  E. Candès,et al.  Controlling the false discovery rate via knockoffs , 2014, 1404.5609.

[8]  M Sesia,et al.  Gene hunting with hidden Markov model knockoffs , 2017, Biometrika.

[9]  C. Mallows More comments on C p , 1995 .

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[12]  Jakub M. Tomczak,et al.  Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction , 2016, Expert Syst. Appl..

[13]  Emmanuel J. Candes,et al.  Robust inference with knockoffs , 2018, The Annals of Statistics.

[14]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[15]  A. Poritz,et al.  Hidden Markov models: a guided tour , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[16]  Abubakar Abid,et al.  Interpretation of Neural Networks is Fragile , 2017, AAAI.

[17]  Lucas Janson,et al.  Panning for gold: ‘model‐X’ knockoffs for high dimensional controlled variable selection , 2016, 1610.02351.

[18]  E. Candès,et al.  Gene Hunting with Knockoffs for Hidden Markov Models , 2017 .

[19]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[20]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[21]  Paulo Cortez,et al.  A data-driven approach to predict the success of bank telemarketing , 2014, Decis. Support Syst..

[22]  Athanassia G. Bacharoglou Approximation of probability distributions by convex mixtures of Gaussian measures , 2010 .

[23]  Motoaki Kawanabe,et al.  How to Explain Individual Classification Decisions , 2009, J. Mach. Learn. Res..

[24]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .