Stabilizing variable selection and regression

We consider regression in which one predicts a response $Y$ with a set of predictors $X$ across different experiments or environments. This is a common setup in many data-driven scientific fields and we argue that statistical inference can benefit from an analysis that takes into account the distributional changes across environments. In particular, it is useful to distinguish between stable and unstable predictors, i.e., predictors which have a fixed or a changing functional dependence on the response, respectively. We introduce stabilized regression which explicitly enforces stability and thus improves generalization performance to previously unseen environments. Our work is motivated by an application in systems biology. Using multiomic data, we demonstrate how hypothesis generation about gene function can benefit from stabilized regression. We believe that a similar line of arguments for exploiting heterogeneity in data can be powerful for many other applications as well. We draw a theoretical connection between multi-environment regression and causal models, which allows to graphically characterize stable versus unstable functional dependence on the response. Formally, we introduce the notion of a stable blanket which is a subset of the predictors that lies between the direct causal predictors and the Markov blanket. We prove that this set is optimal in the sense that a regression based on these predictors minimizes the mean squared prediction error given that the resulting regression generalizes to unseen new environments.

[1]  David R. Anderson,et al.  Practical Use of the Information-Theoretic Approach , 1998 .

[2]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[3]  Bernhard Schölkopf,et al.  Invariant Models for Causal Transfer Learning , 2015, J. Mach. Learn. Res..

[4]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[5]  D. Weed On the logic of causal inference. , 1986, American journal of epidemiology.

[6]  A. Dawid Influence Diagrams for Causal Modelling and Inference , 2002 .

[7]  T. Richardson Single World Intervention Graphs ( SWIGs ) : A Unification of the Counterfactual and Graphical Approaches to Causality , 2013 .

[8]  P. Bühlmann,et al.  Invariance, Causality and Robustness 2018 Neyman Lecture ∗ , 2019 .

[9]  Robert W. Williams,et al.  Modulation of longevity by diet, and youthful body weight, but not by weight gain after maturity , 2019, bioRxiv.

[10]  L. Cooper,et al.  When Networks Disagree: Ensemble Methods for Hybrid Neural Networks , 1992 .

[11]  Rodrigo Dienstmann,et al.  Genomic Determinants of Protein Abundance Variation in Colorectal Cancer Cells , 2016, bioRxiv.

[12]  Ivor W. Tsang,et al.  Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[13]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[14]  Bernhard Schölkopf,et al.  Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[15]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[16]  Rajen Dinesh Shah,et al.  Goodness‐of‐fit tests for high dimensional linear models , 2015, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[17]  Steffen L. Lauritzen,et al.  Independence properties of directed markov fields , 1990, Networks.

[18]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[19]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[20]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[21]  M. Bartlett An Inverse Matrix Adjustment Arising in Discriminant Analysis , 1951 .

[22]  R. Samworth,et al.  Random‐projection ensemble classification , 2015, 1504.04595.

[23]  Sijian Wang,et al.  RANDOM LASSO. , 2011, The annals of applied statistics.

[24]  Christina Heinze-Deml,et al.  Conditional variance penalties and domain shift robustness , 2017, Machine Learning.

[25]  Robert W. Williams,et al.  Gene-by-environmental modulation of longevity and weight gain in the murine BXD family , 2019 .

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Robert W. Williams,et al.  Multi-Omic Profiling of the Liver Across Diets and Age in a Diverse Mouse Population AUTHOR LIST , 2021 .

[28]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Bernhard Schölkopf,et al.  On causal and anticausal learning , 2012, ICML.

[30]  J. Peters,et al.  Invariant Causal Prediction for Sequential Data , 2017, Journal of the American Statistical Association.

[31]  Bernhard Schölkopf,et al.  Domain Adaptation under Target and Conditional Shift , 2013, ICML.

[32]  Bin Yu,et al.  Three principles of data science: predictability, computability, and stability (PCS) , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[33]  Christina Heinze-Deml,et al.  Invariant Causal Prediction for Nonlinear Models , 2017, Journal of Causal Inference.

[34]  G. Morahan,et al.  Increased nicotinamide nucleotide transhydrogenase levels predispose to insulin hypersecretion in a mouse strain susceptible to diabetes , 2007, Diabetologia.

[35]  D. A. Kenny,et al.  Correlation and Causation. , 1982 .

[36]  N. Meinshausen,et al.  Anchor regression: Heterogeneous data meet causality , 2018, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[37]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[38]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[39]  André Elisseeff,et al.  Using Markov Blankets for Causal Structure Learning , 2008, J. Mach. Learn. Res..

[40]  Tsippi Iny Stein,et al.  The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses , 2016, Current protocols in bioinformatics.

[41]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction , 2016 .

[42]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[43]  P. Bühlmann,et al.  Invariance, Causality and Robustness , 2018, Statistical Science.

[44]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[45]  Adrian E. Raftery,et al.  Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors , 1999 .

[46]  Robert W. Williams,et al.  The Molecular Landscape of the Aging Mouse Liver , 2020, bioRxiv.

[47]  A. Pühler,et al.  Molecular systems biology , 2007 .

[48]  Evan G. Williams,et al.  Multilayered Genetic and Omics Dissection of Mitochondrial Activity in a Mouse Reference Population , 2014, Cell.

[49]  D. A. Kenny,et al.  Correlation and Causation , 1937, Wilmott.

[50]  T. Haavelmo,et al.  The probability approach in econometrics , 1944 .

[51]  R. Oaxaca Another Look at Tests of Equality between Sets of Coefficients in Two Linear Regressions , 1974 .

[52]  Daniel Thalmann,et al.  Autonomy , 2005, SIGGRAPH Courses.

[53]  Judea Pearl The Logic of Causal Inference , 2010 .

[54]  Jonas Peters,et al.  Causal inference by using invariant prediction: identification and confidence intervals , 2015, 1501.01332.

[55]  Adrian E. Raftery,et al.  Bayesian Model Averaging: A Tutorial , 2016 .

[56]  Stefan Bauer,et al.  Learning stable and predictive structures in kinetic systems , 2018, Proceedings of the National Academy of Sciences.

[57]  T. Ideker,et al.  A gene ontology inferred from molecular networks , 2012, Nature Biotechnology.

[58]  Mirko Francesconi,et al.  Reconstructing networks of pathways via significance analysis of their intersections , 2008, BMC Bioinformatics.