Data Sampling Approaches with Severely Imbalanced Big Data for Medicare Fraud Detection

Class imbalance is an important problem in machine learning. With increases in available information and the growing use of Big Data sources to extract meaning from data, the challenges associated with class imbalance continue to influence research and shape business value. In this paper, we focus on using highly imbalanced Big Data from Medicare to detect provider claims fraud. We combine three Medicare parts and generate fraud labels using real-world excluded providers. The number of known fraudulent providers is very small, with 0.062% of the combined dataset being labeled as fraud, indicating severe class imbalance. To address class imbalance concerns, we provide experimental results incorporating six different data sampling methods (undersampling and oversampling) to create datasets for five class ratios (imbalanced to balanced), as well as using the full dataset (with no sampling). Three state-of-the-art machine learning models with Apache Spark are used to assess Medicare fraud detection performance across data sampling methods and class ratios. We demonstrate that data sampling, in particular random undersampling, presents good results across all learners, whereas oversampling provides no benefit versus models built using the full dataset.

[1]  Lewis Morris,et al.  Combating fraud in health care: an essential component of any cost containment strategy. , 2009, Health affairs.

[2]  A. Gelman Analysis of variance: Why it is more important than ever? , 2005, math/0504499.

[3]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[4]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[5]  Taghi M. Khoshgoftaar,et al.  Big Data fraud detection using multiple medicare data sources , 2018, J. Big Data.

[6]  Karl Branting,et al.  Graph analytics for healthcare fraud risk estimation , 2016, 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[7]  Richard A. Bauder,et al.  The effects of varying class distribution on learner behavior for medicare fraud detection with imbalanced big data , 2018, Health Inf. Sci. Syst..

[8]  Eizo Kinoshita,et al.  What Is Big Data , 2017 .

[9]  Taghi M. Khoshgoftaar,et al.  The Detection of Medicare Fraud Using Machine Learning Methods with Excluded Provider Labels , 2018, FLAIRS.

[10]  Taghi M. Khoshgoftaar,et al.  Learning with limited minority class data , 2007, ICMLA 2007.

[11]  Taghi M. Khoshgoftaar,et al.  Mining Data with Rare Events: A Case Study , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[12]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[13]  J. Tukey Comparing individual means in the analysis of variance. , 1949, Biometrics.

[14]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007 .

[15]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[16]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[17]  Siti Mariyam Shamsuddin,et al.  Classification with class imbalance problem: A review , 2015, SOCO 2015.

[18]  Taghi M. Khoshgoftaar,et al.  Medicare Fraud Detection Using Random Forest with Class Imbalanced Big Data , 2018, 2018 IEEE International Conference on Information Reuse and Integration (IRI).

[19]  Gisele Roesems-Kerremans,et al.  Big Data in Healthcare , 2016 .

[20]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[21]  Francisco Herrera,et al.  An insight into imbalanced Big Data classification: outcomes and challenges , 2017 .

[22]  Taghi M. Khoshgoftaar,et al.  A Survey of Medicare Data Processing and Integration for Fraud Detection , 2018, 2018 IEEE International Conference on Information Reuse and Integration (IRI).

[23]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[24]  Francisco Herrera,et al.  Analysis of Data Preprocessing Increasing the Oversampling Ratio for Extremely Imbalanced Big Data Classification , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.