Using Undersampling with Ensemble Learning to Identify Factors Contributing to Preterm Birth

In this paper, we propose Ensemble Learning models to identify factors contributing to preterm birth. Our work leverages a rich dataset collected by a NIEHS P42 Center that is trying to identify the dominant factors responsible for the high rate of premature births in northern Puerto Rico. We investigate analytical models addressing two major challenges present in the dataset: 1) the significant amount of incomplete data in the dataset, and 2) class imbalance in the dataset. First, we leverage and compare two types of missing data imputation methods: 1) mean-based and 2) similarity-based, increasing the completeness of this dataset. Second, we propose a feature selection and evaluation model based on using undersampling with Ensemble Learning to address class imbalance present in the dataset. We leverage and compare multiple Ensemble Feature selection methods, including Complete Linear Aggregation (CLA), Weighted Mean Aggregation (WMA), Feature Occurrence Frequency (OFA), and Classification Accuracy Based Aggregation (CAA). To further address missing data present in each feature, we propose two novel methods: 1) Missing Data Rate and Accuracy Based Aggregation (MAA), and 2) Entropy and Accuracy Based Aggregation (EAA). Both proposed models balance the degree of data variance introduced by the missing data handling during the feature selection process while maintaining model performance. Our results show a 42\% improvement in sensitivity versus fallout over previous state-of-the-art methods.

[1]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[2]  Qiang Wang,et al.  Missing categorical data imputation approach based on similarity , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[3]  Peter A. Flach,et al.  Machine Learning - The Art and Science of Algorithms that Make Sense of Data , 2012 .

[4]  Xiangyu Li,et al.  A Hybrid Approach to Identifying Key Factors in Environmental Health Studies , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[5]  M. Kenward,et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls , 2009, BMJ : British Medical Journal.

[6]  Magnus Rattray,et al.  Making sense of big data in health research: Towards an EU action plan , 2016, Genome Medicine.

[7]  Ki-Yeol Kim,et al.  Reuse of imputed data in microarray analysis increases imputation efficiency , 2004, BMC Bioinformatics.

[8]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[9]  I. Kohane,et al.  Big Data and Machine Learning in Health Care. , 2018, JAMA.

[10]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[11]  Carmine Zoccali,et al.  Multiple imputation: dealing with missing data. , 2013, Nephrology, dialysis, transplantation : official publication of the European Dialysis and Transplant Association - European Renal Association.

[12]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[13]  David R. Kaeli,et al.  An Efficient Data Management Framework for Puerto Rico Testsite for Exploring Contamination Threats (PROTECT) , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[14]  Mohamed Limam,et al.  Ensemble feature selection for high dimensional data: a new method and a comparative study , 2017, Advances in Data Analysis and Classification.

[15]  Noel A. Card,et al.  Best practices for missing data management in counseling psychology. , 2010, Journal of counseling psychology.

[16]  Susan M. Bridges,et al.  An Ensemble Method for Identifying Robust Features for Biomarker Discovery , 2008 .

[17]  Yu Zheng,et al.  U-Air: when urban air quality inference meets big data , 2013, KDD.

[18]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[19]  R. Little Missing-Data Adjustments in Large Surveys , 1988 .

[20]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[21]  Ying Chen,et al.  IBM Watson: How Cognitive Computing Can Be Applied to Big Data Challenges in Life Sciences Research. , 2016, Clinical therapeutics.