论文信息 - Efficient big data model selection with applications to fraud detection

Efficient big data model selection with applications to fraud detection

Abstract As the volume and complexity of data continues to grow, more attention is being focused on solving so-called big data problems. One field where this focus is pertinent is credit card fraud detection. Model selection approaches can identify key predictors for preventing fraud. Stagewise Selection is a classic model selection technique that has experienced a revitalized interest due to its computational simplicity and flexibility. Over a sequence of simple learning steps, stagewise techniques build a sequence of candidate models that is less greedy than the stepwise approach. This paper introduces a new stochastic stagewise technique that integrates a sub-sampling approach into the stagewise framework, yielding a simple tool for model selection when working with big data. Simulation studies demonstrate the proposed technique offers a reasonable trade off between computational cost and predictive performance. We apply the proposed approach to synthetic credit card fraud data to demonstrate the technique’s application.

Gregory Vaughan | Gregory Vaughan

[1] Jason Fine,et al. Estimating equations for association structures , 2004, Statistics in medicine.

[2] Stefan Axelsson,et al. Paysim: a financial mobile money simulator for fraud detection , 2016 .

[3] Ming-Hui Chen,et al. Statistical methods and computing for big data. , 2015, Statistics and its interface.

[4] J. Friedman. Greedy function approximation: A gradient boosting machine. , 2001 .

[5] Purnamrita Sarkar,et al. A scalable bootstrap for massive data , 2011, 1112.5016.

[6] Josien P. W. Pluim,et al. Evaluation of Optimization Methods for Nonrigid Medical Image Registration Using Mutual Information and B-Splines , 2007, IEEE Transactions on Image Processing.

[7] Pedro Trancoso,et al. Fine-grain Parallelism Using Multi-core, Cell/BE, and GPU Systems: Accelerating the Phylogenetic Likelihood Function , 2009, 2009 International Conference on Parallel Processing.

[8] N. Meinshausen,et al. Stability selection , 2008, 0809.2932.

[9] Thomas Oommen,et al. Sampling Bias and Class Imbalance in Maximum-likelihood Logistic Regression , 2011 .

[10] Ambuj Tewari,et al. Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[11] R. Tibshirani,et al. Least angle regression , 2004, math/0406456.

[12] Ismaïl Ahmed,et al. Class-imbalanced subsampling lasso algorithm for discovering adverse drug reactions , 2018, Statistical methods in medical research.

[13] Jun Yan,et al. Stagewise generalized estimating equations with grouped variables , 2017, Biometrics.

[14] Stefan Axelsson,et al. Using the RetSim simulator for fraud detection research , 2015, Int. J. Simul. Process. Model..

[15] B. Efron. Bootstrap Methods: Another Look at the Jackknife , 1979 .

[16] J. Wolfson. EEBoost: A General Method for Prediction and Variable Selection Based on Estimating Equations , 2011 .

[17] Michael C. Schatz,et al. Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[18] S. Zeger,et al. Longitudinal data analysis using generalized linear models , 1986 .

[19] Nathalie Japkowicz,et al. The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[20] Max A. Viergever,et al. Adaptive Stochastic Gradient Descent Optimisation for Image Registration , 2009, International Journal of Computer Vision.

[21] Bianca Zadrozny,et al. Outlier detection by active learning , 2006, KDD '06.

[22] Peter Richtárik,et al. Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[23] Tso-Jung Yen,et al. Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[24] David J. Hand,et al. Statistical fraud detection: A review , 2002 .

[25] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[26] J. Friedman. Stochastic gradient boosting , 2002 .

[27] Han Liu,et al. Challenges of Big Data Analysis. , 2013, National science review.

[28] Dan Gorton. IncidentResponseSim: An Agent-Based Simulation Tool for Risk Management of Online Fraud , 2015, NordSec.

[29] Joseph K. Bradley,et al. Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[30] Peter Richtárik,et al. Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[31] Peter Richtárik,et al. Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[32] Gianluca Bontempi,et al. Learned lessons in credit card fraud detection from a practitioner perspective , 2014, Expert Syst. Appl..

[33] Xiaoxiao Sun,et al. Leveraging for big data regression , 2015 .

[34] Ryan J. Tibshirani,et al. A general framework for fast stagewise algorithms , 2014, J. Mach. Learn. Res..