Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis

A considerable amount of health record (HR) data has been stored due to recent advances in the digitalization of medical systems. However, it is not always easy to analyze HR data, particularly when the number of persons with a target disease is too small in comparison with the population. This situation is called the imbalanced data problem. Over-sampling and under-sampling are two approaches for redressing an imbalance between minority and majority examples, which can be combined into ensemble algorithms. However, these approaches do not function when the absolute number of minority examples is small, which is called the extremely imbalanced and small minority (EISM) data problem. The present work proposes a new algorithm called boosting combined with heuristic under-sampling and distribution-based sampling (HUSDOS-Boost) to solve the EISM data problem. To make an artificially balanced dataset from the original imbalanced datasets, HUSDOS-Boost uses both under-sampling and over-sampling to eliminate redundant majority examples based on prior boosting results and to generate artificial minority examples by following the minority class distribution. The performance and characteristics of HUSDOS-Boost were evaluated through application to eight imbalanced datasets. In addition, the algorithm was applied to original clinical HR data to detect patients with stomach cancer. These results showed that HUSDOS-Boost outperformed current imbalanced data handling methods, particularly when the data are EISM. Thus, the proposed HUSDOS-Boost is a useful methodology of HR data analysis.

[1]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[2]  K. Desalvo,et al.  The US Office of the National Coordinator for Health Information Technology: Progress and Promise for the Future at the 10-Year Mark. , 2015, Annals of emergency medicine.

[3]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[4]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[5]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[6]  B. Wells,et al.  Strategies for Handling Missing Data in Electronic Health Record Derived Data , 2013, EGEMS.

[7]  S. Joe Qin,et al.  Reconstruction-Based Fault Identification Using a Combined Index , 2001 .

[8]  Giorgio Valentini,et al.  Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants , 2017, Scientific Reports.

[9]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[10]  Beverly Bell,et al.  From promise to reality: achieving the value of an EHR. , 2011, Healthcare financial management : journal of the Healthcare Financial Management Association.

[11]  Manabu Kano,et al.  A new multivariate statistical process monitoring method using principal component analysis , 2001 .

[12]  T. Krishna Kumar,et al.  Multicollinearity in Regression Analysis , 1975 .

[13]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[14]  Gilles Louppe,et al.  Understanding variable importances in forests of randomized trees , 2013, NIPS.

[15]  K. Taketa,et al.  Age and sex-dependent alterations of serum amylase and isoamylase levels in normal human adults , 1994, Journal of Gastroenterology.

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[18]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[19]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[21]  N. Terry,et al.  The Emergence of National Electronic Health Record Architectures in the United States and Australia: Models, Costs, and Questions , 2005, Journal of medical Internet research.

[22]  D. Turcotte,et al.  The applicability of power-law frequency statistics to floods. , 2006 .

[23]  Xiao-Li Meng,et al.  Applications of multiple imputation in medical studies: from AIDS to NHANES , 1999, Statistical methods in medical research.

[24]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[25]  J. Carter,et al.  Intestinal Calcium Absorption Decreases Dramatically After Gastric Bypass Surgery Despite Optimization of Vitamin D Status , 2015, Journal of bone and mineral research : the official journal of the American Society for Bone and Mineral Research.

[26]  Kay Chen Tan,et al.  Evolutionary Cluster-Based Synthetic Oversampling Ensemble (ECO-Ensemble) for Imbalance Learning , 2017, IEEE Transactions on Cybernetics.

[27]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[28]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[29]  May D. Wang,et al.  –Omic and Electronic Health Record Big Data Analytics for Precision Medicine , 2017, IEEE Transactions on Biomedical Engineering.

[30]  W. W. Muir,et al.  Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1980 .

[31]  Siti Mariyam Shamsuddin,et al.  Classification with class imbalance problem: A review , 2015, SOCO 2015.

[32]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[33]  Eric Horvitz,et al.  Considering Cost Asymmetry in Learning Classifiers , 2006, J. Mach. Learn. Res..

[34]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[35]  Manabu Kano,et al.  Epileptic Seizure Prediction Based on Multivariate Statistical Process Control of Heart Rate Variability Features , 2016, IEEE Transactions on Biomedical Engineering.

[36]  Vasile Palade,et al.  FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning , 2010, IEEE Transactions on Fuzzy Systems.

[37]  David A. Cieslak,et al.  A Robust Decision Tree Algorithm for Imbalanced Data Sets , 2010, SDM.

[38]  Age K. Smilde,et al.  Generalized contribution plots in multivariate statistical process monitoring , 2000 .

[39]  Patrick Kierkegaard,et al.  Electronic health record: Wiring Europe's healthcare , 2011, Comput. Law Secur. Rev..

[40]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[41]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[42]  Theodora Kourti,et al.  Statistical Process Control of Multivariate Processes , 1994 .

[43]  K. Shadan,et al.  Available online: , 2012 .

[44]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..