Feature selection and prediction of small-for-gestational-age infants

The small-for-gestational-age (SGA) condition often causes serious problems. Therefore, identifying the risk factors for SGA is important. Traditional statistical methods such as stepwise logistic regression (LR) have been widely utilized to discover possible risk factors. However, other feature selection methods from machine learning field have rarely been employed for the task. In this paper, a comparison of five feature selection methods from both fields for SGA risk factors analysis is conducted for the first time. To evaluate their performance, four classification algorithms are used to construct SGA prediction models. The evaluation criteria are precision and the area under the receiver operator characteristic curve. Stepwise LR achieves the best performance among the five feature selection methods, because it conducts both a univariate significance test and a model significance test, which make it more suitable for handling the complex relations among features. The top 20 features selected by each feature selection method and the 27 features selected by four or five of them could assist physicians to revise traditional SGA evaluation models. Ensemble method is also exploited to build effective SGA prediction models based on the feature subsets, which is indeed superior compared with the individual ones shown in the results.

[1]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[2]  Xiaoming Xu,et al.  A hybrid genetic algorithm for feature selection wrapper based on mutual information , 2007, Pattern Recognit. Lett..

[3]  P G Lindqvist,et al.  Does antenatal identification of small‐for‐gestational age fetuses significantly improve their outcome? , 2005, Ultrasound in obstetrics & gynecology : the official journal of the International Society of Ultrasound in Obstetrics and Gynecology.

[4]  Tien Yin Wong,et al.  Automatic Glaucoma Diagnosis with mRMR-based Feature Selection , 2012 .

[5]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[7]  Yok-Yen Nguwi,et al.  Support vector self-organizing learning for imbalanced medical data , 2009, 2009 International Joint Conference on Neural Networks.

[8]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[9]  Jianqiang Li,et al.  Emerging information technologies for enhanced healthcare , 2015, Comput. Ind..

[10]  Joyce J. P. A. Bierbooms,et al.  A scenario analysis of the future residential requirements for people with mental health problems in Eindhoven , 2011, BMC Medical Informatics Decis. Mak..

[11]  Jianqiang Li,et al.  Exploiting ensemble learning for automatic cataract detection and grading , 2016, Comput. Methods Programs Biomed..

[12]  P. Czernichow,et al.  International Small for Gestational Age Advisory Board consensus development conference statement: management of short children born small for gestational age, April 24-October 1, 2001. , 2003, Pediatrics.

[13]  S. Cianfarani,et al.  Hormonal Regulation of Postnatal Growth in Children Born Small for Gestational Age , 2006, Hormone Research in Paediatrics.

[14]  Yongcai Wang,et al.  Diversity-aware retrieval of medical records , 2015, Comput. Ind..

[15]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[16]  Jianqiang Li,et al.  A data-driven approach to predict Small-for-Gestational-Age infants , 2016, 2016 IEEE 13th International Conference on Networking, Sensing, and Control (ICNSC).

[17]  Gail M Williams,et al.  Learning, cognitive, and attentional problems in adolescents born small for gestational age. , 2003, Pediatrics.

[18]  Xinzhu Lin,et al.  [Chinese neonatal birth weight curve for different gestational age]. , 2015, Zhonghua er ke za zhi = Chinese journal of pediatrics.

[19]  G. Dekker,et al.  Risk factors for small‐for‐gestational‐age infants by customised birthweight centiles: data from an international prospective cohort study , 2010, BJOG : an international journal of obstetrics and gynaecology.

[20]  K. Pearson On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1900 .

[21]  F. Battaglia,et al.  Birth weight, gestational age, and pregnancy out- come, with special reference to high birth weight-low gestational age infant. , 1966, Pediatrics.

[22]  Bojana Dalbelo Basic,et al.  Multivariate logistic regression prediction of fault-proneness in software modules , 2012, 2012 Proceedings of the 35th International Convention MIPRO.

[23]  Kypros H. Nicolaides,et al.  Prediction of Small-for-Gestation Neonates from Biophysical and Biochemical Markers at 11–13 Weeks , 2010, Fetal Diagnosis and Therapy.

[24]  A. H. Pooi,et al.  Performance of the Likelihood Ratio Test When Fitting Logistic Regression Models with Small Samples , 2003 .

[25]  F. Battaglia,et al.  A practical classification of newborn infants by weight and gestational age. , 1967, The Journal of pediatrics.

[26]  Shan Suthaharan,et al.  Support Vector Machine , 2016 .

[27]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[28]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[29]  Shikun Zhang,et al.  [Design of the national free proception health examination project in China]. , 2015, Zhonghua yi xue za zhi.

[30]  M. Whittle,et al.  Prediction of the Small for Gestational Age Twin Fetus by Doppler Umbilical Artery Waveform Analysis , 1989, Obstetrics and gynecology.

[31]  MengChu Zhou,et al.  A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence , 2016, Knowl. Based Syst..

[32]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[33]  Rajiv Raju Relative Importance of Fine Needle Aspiration Features for Breast Cancer Diagnosis: A Study Using Information Gain Evaluation and Machine Learning , 2012 .

[34]  Fei Wang,et al.  Combining Knowledge and Data Driven Insights for Identifying Risk Factors using Electronic Health Records , 2012, AMIA.

[35]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[36]  Fei Wang,et al.  Semi-supervised learning via mean field methods , 2016, Neurocomputing.

[37]  Karin Bammann,et al.  Statistical Models: Theory and Practice , 2006 .

[38]  L. Sadler,et al.  Independent risk factors for infants who are small for gestational age by customised birthweight centiles in a multi‐ethnic New Zealand population , 2013, The Australian & New Zealand journal of obstetrics & gynaecology.

[39]  Mohammad Khalilia,et al.  Predicting disease risks from highly imbalanced data using random forest , 2011, BMC Medical Informatics Decis. Mak..

[40]  Bjoern H. Menze,et al.  A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data , 2009, BMC Bioinformatics.

[41]  J. Graham,et al.  Evaluation of neonates born with intrauterine growth retardation: review and practice guidelines. , 1998, Journal of perinatology : official journal of the California Perinatal Association.

[42]  Antanas Verikas,et al.  Mining data with random forests: A survey and results of new tests , 2011, Pattern Recognit..

[43]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[44]  Y. Chou,et al.  Stepwise logistic regression analysis of tumor contour features for breast ultrasound diagnosis. , 2001, Ultrasound in Medicine and Biology.

[45]  Karl Pearson F.R.S. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .