Efficient big data model selection with applications to fraud detection

Abstract As the volume and complexity of data continues to grow, more attention is being focused on solving so-called big data problems. One field where this focus is pertinent is credit card fraud detection. Model selection approaches can identify key predictors for preventing fraud. Stagewise Selection is a classic model selection technique that has experienced a revitalized interest due to its computational simplicity and flexibility. Over a sequence of simple learning steps, stagewise techniques build a sequence of candidate models that is less greedy than the stepwise approach. This paper introduces a new stochastic stagewise technique that integrates a sub-sampling approach into the stagewise framework, yielding a simple tool for model selection when working with big data. Simulation studies demonstrate the proposed technique offers a reasonable trade off between computational cost and predictive performance. We apply the proposed approach to synthetic credit card fraud data to demonstrate the technique’s application.

[1]  Jason Fine,et al.  Estimating equations for association structures , 2004, Statistics in medicine.

[2]  Stefan Axelsson,et al.  Paysim: a financial mobile money simulator for fraud detection , 2016 .

[3]  Ming-Hui Chen,et al.  Statistical methods and computing for big data. , 2015, Statistics and its interface.

[4]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[5]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[6]  Josien P. W. Pluim,et al.  Evaluation of Optimization Methods for Nonrigid Medical Image Registration Using Mutual Information and B-Splines , 2007, IEEE Transactions on Image Processing.

[7]  Pedro Trancoso,et al.  Fine-grain Parallelism Using Multi-core, Cell/BE, and GPU Systems: Accelerating the Phylogenetic Likelihood Function , 2009, 2009 International Conference on Parallel Processing.

[8]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[9]  Thomas Oommen,et al.  Sampling Bias and Class Imbalance in Maximum-likelihood Logistic Regression , 2011 .

[10]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[11]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[12]  Ismaïl Ahmed,et al.  Class-imbalanced subsampling lasso algorithm for discovering adverse drug reactions , 2018, Statistical methods in medical research.

[13]  Jun Yan,et al.  Stagewise generalized estimating equations with grouped variables , 2017, Biometrics.

[14]  Stefan Axelsson,et al.  Using the RetSim simulator for fraud detection research , 2015, Int. J. Simul. Process. Model..

[15]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[16]  J. Wolfson EEBoost: A General Method for Prediction and Variable Selection Based on Estimating Equations , 2011 .

[17]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[18]  S. Zeger,et al.  Longitudinal data analysis using generalized linear models , 1986 .

[19]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[20]  Max A. Viergever,et al.  Adaptive Stochastic Gradient Descent Optimisation for Image Registration , 2009, International Journal of Computer Vision.

[21]  Bianca Zadrozny,et al.  Outlier detection by active learning , 2006, KDD '06.

[22]  Peter Richtárik,et al.  Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[23]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[24]  David J. Hand,et al.  Statistical fraud detection: A review , 2002 .

[25]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[26]  J. Friedman Stochastic gradient boosting , 2002 .

[27]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[28]  Dan Gorton IncidentResponseSim: An Agent-Based Simulation Tool for Risk Management of Online Fraud , 2015, NordSec.

[29]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[30]  Peter Richtárik,et al.  Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[31]  Peter Richtárik,et al.  Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[32]  Gianluca Bontempi,et al.  Learned lessons in credit card fraud detection from a practitioner perspective , 2014, Expert Syst. Appl..

[33]  Xiaoxiao Sun,et al.  Leveraging for big data regression , 2015 .

[34]  Ryan J. Tibshirani,et al.  A general framework for fast stagewise algorithms , 2014, J. Mach. Learn. Res..