Synthetic minority oversampling of vital statistics data with generative adversarial networks

Abstract Objective Minority oversampling is a standard approach used for adjusting the ratio between the classes on imbalanced data. However, established methods often provide modest improvements in classification performance when applied to data with extremely imbalanced class distribution and to mixed-type data. This is usual for vital statistics data, in which the outcome incidence dictates the amount of positive observations. In this article, we developed a novel neural network-based oversampling method called actGAN (activation-specific generative adversarial network) that can derive useful synthetic observations in terms of increasing prediction performance in this context. Materials and Methods From vital statistics data, the outcome of early stillbirth was chosen to be predicted based on demographics, pregnancy history, and infections. The data contained 363 560 live births and 139 early stillbirths, resulting in class imbalance of 99.96% and 0.04%. The hyperparameters of actGAN and a baseline method SMOTE-NC (Synthetic Minority Over-sampling Technique-Nominal Continuous) were tuned with Bayesian optimization, and both were compared against a cost-sensitive learning-only approach. Results While SMOTE-NC provided mixed results, actGAN was able to improve true positive rate at a clinically significant false positive rate and area under the curve from the receiver-operating characteristic curve consistently. Discussion Including an activation-specific output layer to a generator network of actGAN enables the addition of information about the underlying data structure, which overperforms the nominal mechanism of SMOTE-NC. Conclusions actGAN provides an improvement to the prediction performance for our learning task. Our developed method could be applied to other mixed-type data prediction tasks that are known to be afflicted by class imbalance and limited data availability.

[1]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[2]  K. Nicolaides,et al.  Prediction of stillbirth from maternal demographic and pregnancy characteristics , 2016, Ultrasound in obstetrics & gynecology : the official journal of the International Society of Ultrasound in Obstetrics and Gynecology.

[3]  Stefan Schaal,et al.  Encyclopedia of Machine Learning , 2010 .

[4]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  M. Mukaka,et al.  Statistics corner: A guide to appropriate use of correlation coefficient in medical research. , 2012, Malawi medical journal : the journal of Medical Association of Malawi.

[6]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[7]  Gary M. Weiss,et al.  Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? , 2007, DMIN.

[8]  Richard J. Anderson At the University of Arkansas , 1980 .

[9]  K. Nicolaides,et al.  Prediction of stillbirth from biochemical and biophysical markers at 11–13 weeks , 2016, Ultrasound in obstetrics & gynecology : the official journal of the International Society of Ultrasound in Obstetrics and Gynecology.

[10]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[11]  C. Weinberg,et al.  Risk factors for antepartum and intrapartum stillbirth. , 1993, American journal of epidemiology.

[12]  R. Fletcher Practical Methods of Optimization , 1988 .

[13]  G. Colditz,et al.  A stillbirth calculator: Development and internal validation of a clinical prediction model to quantify stillbirth risk , 2017, PloS one.

[14]  Rok Blagus,et al.  Evaluation of SMOTE for High-Dimensional Class-Imbalanced Microarray Data , 2012, 2012 11th International Conference on Machine Learning and Applications.

[15]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[16]  Victor S. Sheng,et al.  Cost-Sensitive Learning and the Class Imbalance Problem , 2008 .

[17]  Daniela Fischer,et al.  Digital Design And Computer Architecture , 2016 .

[18]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[19]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[20]  Joseph E. Hoag,et al.  Synthetic data generation: theory, techniques and applications , 2008 .

[21]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[22]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[23]  Asad Malik,et al.  Maternal and fetal risk factors for stillbirth: population based study , 2013, BMJ.

[24]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[25]  Richard Hans Robert Hahnloser,et al.  Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit , 2000, Nature.

[26]  Lei Xu,et al.  Synthesizing Tabular Data using Generative Adversarial Networks , 2018, ArXiv.

[27]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[28]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[29]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[30]  C. Kambhampati,et al.  Balancing Class for Performance of Classification with a Clinical Dataset , 2022 .

[31]  D. Grobbee,et al.  Predicting stillbirth in a low resource setting , 2016, BMC Pregnancy and Childbirth.

[32]  J. DiSantostefano,et al.  International Classification of Diseases 10th Revision (ICD-10) , 2009 .

[33]  K. Nicolaides,et al.  Prediction of miscarriage and stillbirth at 11–13 weeks and the contribution of chorionic villus sampling , 2011, Prenatal diagnosis.

[34]  S. Finklestein,et al.  The predictive power of diagnostic tests and the effect of prevalence of illness. , 1983, Archives of general psychiatry.

[35]  M. Ezzati,et al.  Major risk factors for stillbirth in high-income countries: a systematic review and meta-analysis , 2011, The Lancet.

[36]  Paul Babyn,et al.  Generative Adversarial Network in Medical Imaging: A Review , 2018, Medical Image Anal..

[37]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[38]  Aki Koivu,et al.  Predicting risk of stillbirth and preterm pregnancies with machine learning , 2020, Health Information Science and Systems.

[39]  Gary L Darmstadt,et al.  Reducing stillbirths: screening and monitoring during pregnancy and labour , 2009, BMC pregnancy and childbirth.

[40]  S. Saleem,et al.  Stillbirth in developing countries: a review of causes, risk factors and prevention strategies , 2009, The journal of maternal-fetal & neonatal medicine : the official journal of the European Association of Perinatal Medicine, the Federation of Asia and Oceania Perinatal Societies, the International Society of Perinatal Obstetricians.