Logistic regression in large rare events and imbalanced data: A performance comparison of prior correction and weighting methods

The purpose of this study is to use the truncated Newton method in prior correction logistic regression (LR). A regularization term is added to prior correction LR to improve its performance, which results in the truncated‐regularized prior correction algorithm. The performance of this algorithm is compared with that of weighted LR and the regular LR methods for large imbalanced binary class data sets. The results, based on the KDD99 intrusion detection data set, and 6 other data sets at both the prior correction and the weighted LRs have the same computational efficiency when the truncated Newton method is used in both of them. A higher discriminative performance, however, resulted from weighting, which exceeded both the prior correction and the regular LR on nearly all the data sets. From this study, we conclude that weighting outperforms both the regular and prior correction LR models in most data sets and it is the method of choice when LR is used to evaluate imbalanced and rare event data.

[1]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[2]  Mikhail F. Kanevski,et al.  Prediction of Landslide Susceptibility Using logistic Regression: A Case Study in Bailongjiang River Basin, China , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[3]  Peter Kaiser,et al.  Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning , 2009, PLoS Comput. Biol..

[4]  P. McCullagh,et al.  Bias Correction in Generalized Linear Models , 1991 .

[5]  Thomas Oommen,et al.  Sampling Bias and Class Imbalance in Maximum-likelihood Logistic Regression , 2011 .

[6]  Paul R. Cohen,et al.  Multiple Comparisons in Induction Algorithms , 2000, Machine Learning.

[7]  Douglas C. Montgomery,et al.  The Generalized Linear Model , 2012 .

[8]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[9]  Steven R. Lerman,et al.  The Estimation of Choice Probabilities from Choice Based Samples , 1977 .

[10]  Veerle Vanacker,et al.  Logistic regression applied to natural hazards: rare event logistic regression with replications , 2012 .

[11]  Joaquín A. Pacheco,et al.  A variable selection method based on Tabu search for logistic regression models , 2009, Eur. J. Oper. Res..

[12]  Maher Salem Adaptive Real-time Anomaly-based Intrusion Detection using Data Mining and Machine Learning Techniques , 2014 .

[13]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[14]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[15]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[16]  Moshe Ben-Akiva,et al.  Discrete Choice Analysis: Theory and Application to Travel Demand , 1985 .

[17]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[18]  C. Manski,et al.  The Logit Model and Response-Based Samples , 1989 .

[19]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[20]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[21]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[22]  Maher Maalouf,et al.  Kernel logistic regression using truncated Newton method , 2011, Comput. Manag. Sci..

[23]  Maher Maalouf,et al.  Weighted logistic regression for large-scale imbalanced and rare events data , 2014, Knowl. Based Syst..

[24]  Maher Maalouf,et al.  Logistic regression in data analysis: an overview , 2011, Int. J. Data Anal. Tech. Strateg..

[25]  M. Eeckhaut,et al.  Prediction of landslide susceptibility using rare events logistic regression: A case-study in the Flemish Ardennes (Belgium) , 2006 .

[26]  Ralescu Anca,et al.  ISSUES IN MINING IMBALANCED DATA SETS - A REVIEW PAPER , 2005 .

[27]  Lawrence E. Barker,et al.  Logit Models From Economics and Other Fields , 2005, Technometrics.

[28]  Johan A. K. Suykens,et al.  Multi-class kernel logistic regression: a fixed-size implementation , 2007, IJCNN.

[29]  Andrew W. Moore,et al.  Making logistic regression a core data mining tool with TR-IRLS , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[30]  A. Cameron,et al.  Microeconometrics: Methods and Applications , 2005 .

[31]  Chih-Jen Lin,et al.  Trust region Newton methods for large-scale logistic regression , 2007, ICML '07.

[32]  Maher Maalouf,et al.  Rare events and imbalanced datasets: an overview , 2011, Int. J. Data Min. Model. Manag..

[33]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[34]  Maher Maalouf,et al.  Computational Statistics and Data Analysis Robust Weighted Kernel Logistic Regression in Imbalanced and Rare Events Data , 2022 .

[35]  Krishna G. Palepu,et al.  Predicting takeover targets: A methodological and empirical analysis , 1986 .

[36]  Andrew W. Moore,et al.  Logistic regression for data mining and high-dimensional classification , 2004 .

[37]  G. Imbens,et al.  Efficient estimation and stratified sampling , 1996 .

[38]  D. Collett,et al.  Modelling Binary Data , 1991 .

[39]  D. Cox,et al.  A General Definition of Residuals , 1968 .