Data Augmentation for Heart Arrhythmia Classification

In this paper, we introduce a technique for data augmentation that has been applied to an ECG dataset from the UKBiobank for heart arrhythmia classification using the XGBoost algorithm. In the majority of clinical datasets, the number of participants with a disease (positive samples) is considerably lower than the number of healthy participants (negative samples). Hence, when it comes to using the data in machine learning, there are not enough cases of the diseased participants for the algorithm to train a model. We have developed techniques to overcome this limitation by up-sampling the positive cases. To validate our technique we have evaluated its reliability by comparing the augmented data set with the original data distribution using the Wilcoxon signed rank statistical significance test. We have also compared the results with and without data augmentation on the XGBoost classifier, and have used the AUC (area under the curve) and the Cohen's Kappa as the evaluation metrics. In our results, the AUC improved from 0.58 without augmentation to 0.83 with augmentation and the Cohen's kappa improved from 0 to 0.76. Our metrics values show the agreement is substantial. These techniques can be used on any other data and are not limited to clinical studies.

[1]  Yuji Iwahori,et al.  A Method of Data Augmentation for Classifying Road Damage Considering Influence on Classification Accuracy , 2019, KES.

[2]  Steven A. Israel,et al.  Generative Adversarial Networks for Classification , 2017, 2017 IEEE Applied Imagery Pattern Recognition Workshop (AIPR).

[3]  J. Needleman,et al.  Statistical significance testing and p-values: Defending the indefensible? A discussion paper and position statement. , 2019, International journal of nursing studies.

[4]  Hiroshi Inoue,et al.  Data Augmentation by Pairing Samples for Images Classification , 2018, ArXiv.

[5]  Markus Neuhäuser,et al.  Wilcoxon Signed Rank Test , 2006 .

[6]  Peter Corcoran,et al.  Smart Augmentation Learning an Optimal Data Augmentation Strategy , 2017, IEEE Access.

[7]  Thomas Villmann,et al.  Precision-Recall-Optimization in Learning Vector Quantization Classifiers for Improved Medical Classification Systems , 2014, 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[8]  Hayit Greenspan,et al.  GAN-based Synthetic Medical Image Augmentation for increased CNN Performance in Liver Lesion Classification , 2018, Neurocomputing.

[9]  Luis Perez,et al.  The Effectiveness of Data Augmentation in Image Classification using Deep Learning , 2017, ArXiv.

[10]  Lejla Gurbeta,et al.  Machine learning techniques for classification of diabetes and cardiovascular diseases , 2017, 2017 6th Mediterranean Conference on Embedded Computing (MECO).

[11]  Mark D. McDonnell,et al.  Understanding Data Augmentation for Classification: When to Warp? , 2016, 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[12]  K. Reinier,et al.  Tpeak-to-Tend interval corrected for heart rate: A more precise measure of increased sudden death risk? , 2015, Heart rhythm.

[13]  Paul M. Matthews,et al.  The UK Biobank. , 2015, Brain : a journal of neurology.

[14]  Uzay Kaymak,et al.  Cohen's kappa coefficient as a performance measure for feature selection , 2010, International Conference on Fuzzy Systems.

[15]  B. W. Yap,et al.  Comparisons of various types of normality tests , 2011 .

[16]  Antônio de Pádua Braga,et al.  Optimization of the Area under the ROC Curve , 2008, 2008 10th Brazilian Symposium on Neural Networks.