Learning High-Order Interactions via Targeted Pattern Search

Logistic Regression (LR) is a widely used statistical method in empirical binary classification studies. However, real-life scenarios oftentimes share complexities that prevent from the use of the as-is LR model, and instead highlight the need to include high-order interactions to capture data variability. This becomes even more challenging because of: (i) datasets growing wider, with more and more variables; (ii) studies being typically conducted in strongly imbalanced settings; (iii) samples going from very large to extremely small; (iv) the need of providing both predictive models and interpretable results. In this paper we present a novel algorithm, Learning high-order Interactions via targeted Pattern Search (LIPS), to select interaction terms of varying order to include in a LR model for an imbalanced binary classification task when input data are categorical. LIPS’s rationale stems from the duality between item sets and categorical interactions. The algorithm relies on an interaction learning step based on a well-known frequent item set mining algorithm, and a novel dissimilarity-based interaction selection step that allows the user to specify the number of interactions to be included in the LR model. In addition, we particularize two variants (Scores LIPS and Clusters LIPS), that can address even more specific needs. Through a set of experiments we validate our algorithm and prove its wide applicability to real-life research scenarios, showing that it outperforms a benchmark state-of-the-art algorithm.

[1]  Karsten M. Borgwardt,et al.  Finding Statistically Significant Interactions between Continuous Features , 2017, IJCAI.

[2]  T. Hastie,et al.  Learning Interactions via Hierarchical Group-Lasso Regularization , 2015, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[3]  Lucila Ohno-Machado,et al.  Logistic regression and artificial neural network classification models: a methodology review , 2002, J. Biomed. Informatics.

[4]  Leonardo Pellegrina,et al.  Efficient mining of the most significant patterns with permutation testing , 2018, Data Mining and Knowledge Discovery.

[5]  R. Tibshirani,et al.  A LASSO FOR HIERARCHICAL INTERACTIONS. , 2012, Annals of statistics.

[6]  John Michael Robson,et al.  Algorithms for Maximum Independent Sets , 1986, J. Algorithms.

[7]  Lian Niu A review of the application of logistic regression in educational research: common issues, implications, and suggestions , 2018, Educational Review.

[8]  Gareth M. James,et al.  Variable Selection Using Adaptive Nonlinear Interaction Structures in High Dimensions , 2010 .

[9]  Karsten M. Borgwardt,et al.  Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing , 2015, KDD.

[10]  Karsten M. Borgwardt,et al.  CASMAP: detection of statistically significant combinations of SNPs in association mapping , 2019, Bioinform..

[11]  G. Ruxton,et al.  Review of alternative approaches to calculation of a confidence interval for the odds ratio of a 2 × 2 contingency table , 2013 .

[12]  Christian Borgelt,et al.  Frequent item set mining , 2012, WIREs Data Mining Knowl. Discov..

[13]  Bekti Cahyo Hidayanto,et al.  Network Intrusion Detection Systems Analysis using Frequent Item Set Mining Algorithm FP-Max and Apriori , 2017 .

[14]  Fred L. Drake,et al.  Python 3 Reference Manual , 2009 .

[15]  Taha Zaghdoudi,et al.  Bank Failure Prediction with Logistic Regression , 2013 .

[16]  The Economics of Low Pay in Britain: A Logistic Regression Approach , 1994 .

[17]  Ernest Yeboah Boateng,et al.  A Review of the Logistic Regression Model with Emphasis on Medical Research , 2019, Journal of Data Analysis and Information Processing.

[18]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[19]  R. Agarwal Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[20]  Etsuji Tomita,et al.  An Efficient Branch-and-bound Algorithm for Finding a Maximum Clique with Computational Experiments , 2001, J. Glob. Optim..

[21]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.

[22]  Lorenzo Rosasco,et al.  A Regularization Approach to Nonlinear Variable Selection , 2010, AISTATS.

[23]  E. Candès,et al.  A modern maximum-likelihood theory for high-dimensional logistic regression , 2018, Proceedings of the National Academy of Sciences.

[24]  S. Lemon,et al.  Classification and regression tree analysis in public health: Methodological review and comparison with logistic regression , 2003, Annals of behavioral medicine : a publication of the Society of Behavioral Medicine.

[25]  Wallace Alvin Wilson,et al.  On Semi-Metric Spaces , 1931 .

[26]  Kerrie L. Mengersen,et al.  Methods for Identifying SNP Interactions: A Review on Variations of Logic Regression, Random Forest and Bayesian Logistic Regression , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  Karsten M. Borgwardt,et al.  Finding significant combinations of features in the presence of categorical covariates , 2016, NIPS.

[28]  Laurent Briollais,et al.  SNP-SNP interactions in breast cancer susceptibility , 2006, BMC Cancer.

[29]  Rajen Dinesh Shah,et al.  Random intersection trees , 2013, J. Mach. Learn. Res..

[30]  Marco Masseroli,et al.  Association rule mining to identify transcription factor interactions in genomic regions , 2019, Bioinform..

[31]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[32]  Robert E. Tarjan,et al.  Finding a Maximum Independent Set , 1976, SIAM J. Comput..

[33]  Janez Konc,et al.  An improved branch and bound algorithm for the maximum clique problem , 2007 .

[34]  Jairo Nicolau An Analysis of the 2002 Presidential Elections Using Logistic Regression , 2007 .