Learning generative models for valid knockoffs using novel multivariate-rank based statistics

We consider the problem of generating valid knockoffs for knockoff filtering which is a statistical method that provides provable false discovery rate guarantees for any model selection procedure. To this end, we are motivated by recent advances in multivariate distribution-free goodness-of-fit tests namely, the rank energy (RE), that is derived using theoretical results characterizing the optimal maps in the Monge’s Optimal Transport (OT) problem. However, direct use of use RE for learning generative models is not feasible because of its high computational and sample complexity, saturation under large support discrepancy between distributions, and non-differentiability in generative parameters. To alleviate these, we begin by proposing a variant of the RE, dubbed as soft rank energy (sRE), and its kernel variant called as soft rank maximum mean discrepancy (sRMMD) using entropic regularization of Monge’s OT problem. We then use sRMMD to generate deep knockoffs and show via extensive evaluation that it is a novel and effective method to produce valid knockoffs, achieving comparable, or in some cases improved tradeoffs between detection power Vs false discoveries.

[1]  Sameh Saber,et al.  Olmesartan ameliorates chemically‐induced ulcerative colitis in rats via modulating NF&kgr;B and Nrf‐2/HO‐1 signaling crosstalk , 2019, Toxicology and applied pharmacology.

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[4]  Mihaela van der Schaar,et al.  KnockoffGAN: Generating Knockoffs for Feature Selection using Generative Adversarial Networks , 2018, ICLR.

[5]  M Sesia,et al.  Gene hunting with hidden Markov model knockoffs , 2017, Biometrika.

[6]  T. Lehtimäki,et al.  Serum fatty acid profile in subjects with irritable bowel syndrome , 2011, Scandinavian journal of gastroenterology.

[7]  C. Bauset,et al.  Metabolomics as a Promising Resource Identifying Potential Biomarkers for Inflammatory Bowel Disease , 2021, Journal of clinical medicine.

[8]  X. Qin Etiology of inflammatory bowel disease: a unified hypothesis. , 2012, World journal of gastroenterology.

[9]  James Y. Zou,et al.  Knockoffs for the mass: new feature importance statistics with false discovery guarantees , 2018, AISTATS.

[10]  M. Hallin On Distribution and Quantile Functions, Ranks and Signs in R_d , 2017 .

[11]  R. Caprilli,et al.  The long journey of salicylates in ulcerative colitis: The past and the future. , 2009, Journal of Crohn's & colitis.

[12]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[13]  Jonathan Weed,et al.  Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem , 2019, NeurIPS.

[14]  R. McCann Existence and uniqueness of monotone measure-preserving maps , 1995 .

[15]  B. Michalke,et al.  Oral versus intravenous iron replacement therapy distinctly alters the gut microbiota and metabolome in patients with IBD , 2016, Gut.

[16]  Y. Benjamini,et al.  An adaptive step-down procedure with proven FDR control under independence , 2009, 0903.5373.

[17]  Alain Trouvé,et al.  Interpolating between Optimal Transport and MMD using Sinkhorn Divergences , 2018, AISTATS.

[18]  Ying Liu,et al.  Auto-Encoding Knockoff Generator for FDR Controlled Variable Selection , 2018, 1809.10765.

[19]  Gabriel Peyré,et al.  Sample Complexity of Sinkhorn Divergences , 2018, AISTATS.

[20]  L. Baringhaus,et al.  On a new multivariate two-sample test , 2004 .

[21]  Arthur Gretton,et al.  Demystifying MMD GANs , 2018, ICLR.

[22]  John C. Earls,et al.  A wellness study of 108 individuals using personal, dense, dynamic data clouds , 2017, Nature Biotechnology.

[23]  H. Sokol,et al.  Gut microbiota-derived metabolites as key actors in inflammatory bowel disease , 2020, Nature Reviews Gastroenterology & Hepatology.

[24]  Nikola Bogunovic,et al.  A review of feature selection methods with applications , 2015, 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[25]  F. Wilcoxon,et al.  Probability tables for individual comparisons by ranking methods. , 1947, Biometrics.

[26]  Malgorzata Bogdan,et al.  Modified versions of Bayesian Information Criterion for genome-wide association studies , 2012, Comput. Stat. Data Anal..

[27]  Gabriel Peyré,et al.  Convergence of Entropic Schemes for Optimal Transport and Gradient Flows , 2015, SIAM J. Math. Anal..

[28]  E. K. Kemsley,et al.  Metabolomics of fecal extracts detects altered metabolic activity of gut microbiota in ulcerative colitis and irritable bowel syndrome. , 2011, Journal of proteome research.

[29]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[30]  Nicolas Courty,et al.  Large Scale Optimal Transport and Mapping Estimation , 2017, ICLR.

[31]  J. Wolfowitz,et al.  On a Test Whether Two Samples are from the Same Population , 1940 .

[32]  Kelly V. Ruggles,et al.  Predictive Metagenomic Analysis of Autoimmune Disease Identifies Robust Autoimmunity and Disease Specific Signatures , 2019, bioRxiv.

[33]  Filippo Santambrogio,et al.  Optimal Transport for Applied Mathematicians , 2015 .

[34]  M. Drton,et al.  Distribution-Free Consistent Independence Tests via Center-Outward Ranks and Signs , 2019, Journal of the American Statistical Association.

[35]  Lucas Janson,et al.  Panning for gold: ‘model‐X’ knockoffs for high dimensional controlled variable selection , 2016, 1610.02351.

[36]  Daniel Hlubinka,et al.  Efficient Fully Distribution-Free Center-Outward Rank Tests for Multiple-Output Regression and MANOVA , 2020, Journal of the American Statistical Association.

[37]  M. Fujishima,et al.  Serum n3 Polyunsaturated Fatty Acids Are Depleted in Crohn's Disease , 1997, Digestive Diseases and Sciences.

[38]  M. Cabana,et al.  Elevated faecal 12,13-diHOME concentration in neonates at high risk for asthma is produced by gut bacteria and impedes immune tolerance , 2019, Nature Microbiology.

[39]  Jingbo Liu,et al.  Power analysis of knockoff filters for correlated designs , 2019, NeurIPS.

[40]  Joshua Heinemann,et al.  Machine Learning in Untargeted Metabolomics Experiments. , 2018, Methods in molecular biology.

[41]  Bodhisattva Sen,et al.  Rates of Estimation of Optimal Transport Maps using Plug-in Estimators via Barycentric Projections , 2021, NeurIPS.

[42]  E. Candès,et al.  Deep Knockoffs , 2018, Journal of the American Statistical Association.

[43]  Mihaela van der Schaar,et al.  Feature Selection for Survival Analysis with Competing Risks using Deep Learning , 2018, ArXiv.

[44]  Kevin S. Bonham,et al.  Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases , 2019, Nature.

[45]  T. W. Anderson On the Distribution of the Two-Sample Cramer-von Mises Criterion , 1962 .

[46]  Carlos Matrán,et al.  Distribution and quantile functions, ranks and signs in dimension d: A measure transportation approach , 2021, The Annals of Statistics.

[47]  Marco Cuturi,et al.  Computational Optimal Transport: With Applications to Data Science , 2019 .

[48]  Shesh N. Rai,et al.  Evaluation of Classifier Performance for Multiclass Phenotype Discrimination in Untargeted Metabolomics , 2017, bioRxiv.

[49]  Vivien Seguy,et al.  Smooth and Sparse Optimal Transport , 2017, AISTATS.

[50]  V. Chernozhukov,et al.  Monge-Kantorovich Depth, Quantiles, Ranks and Signs , 2014, 1412.8434.

[51]  Piotr S. Gromski,et al.  Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data , 2014, Metabolites.

[52]  Dirk A. Lorenz,et al.  Entropic regularization of continuous optimal transport problems , 2019, 1906.01333.

[53]  Bodhisattva Sen,et al.  Multivariate Rank-Based Distribution-Free Nonparametric Testing Using Measure Transportation , 2019, Journal of the American Statistical Association.

[54]  J. Pokrotnieks,et al.  Alterations in Polyunsaturated Fatty Acid Metabolism and Reduced Serum Eicosadienoic Acid Level in Ulcerative Colitis: Is There a Place for Metabolomic Fatty Acid Biomarkers in IBD? , 2018, Digestive Diseases and Sciences.

[55]  T. Gaginella,et al.  Colonic inflammation in the rabbit induced by phorbol-12-myristate-13-acetate , 1990, Inflammation.

[56]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[57]  Maria L. Rizzo,et al.  Energy statistics: A class of statistics based on distances , 2013 .