Genetic Algorithms and Classification Trees in Feature Discovery: Diabetes and the NHANES database

This paper presents a feature selection methodology that can be applied to datasets containing a mixture of continuous and categorical variables. Using a Genetic Algorithm (GA), this method explores a dataset and selects a small set of features relevant for the prediction of a binary (1/0) response. Binary classification trees and an objective function based on conditional probabilities are used to measure the fitness of a given subset of features. The method is applied to health data in order to find factors useful for the prediction of diabetes. Results show that our algorithm is capable of narrowing down the set of predictors to around 8 factors that can be validated using reputable medical and public health resources.

[1]  Elizabeth A Yetley,et al.  Folate and vitamin B-12 biomarkers in NHANES: history of their measurement and use , 2011, The American journal of clinical nutrition.

[2]  Chaoyang Li,et al.  Prevalence of self-reported clinically diagnosed sleep apnea according to obesity status in men and women: National Health and Nutrition Examination Survey, 2005-2006. , 2010, Preventive medicine.

[3]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[4]  Padraic G. Neville,et al.  Decision Trees for Predictive Modeling , 2003 .

[5]  Jian Pei,et al.  Exploring Disease Association from the NHANES Data: Data Mining, Pattern Summarization, and Visual Analytics , 2010, Int. J. Data Warehous. Min..

[6]  William Eberle,et al.  Genetic algorithms in feature and instance selection , 2013, Knowl. Based Syst..

[7]  L S Geiss,et al.  Projection of diabetes burden through 2050: impact of changing demography and disease prevalence in the U.S. , 2001, Diabetes care.

[8]  J L Annest,et al.  Chronological trend in blood lead levels between 1976 and 1980. , 1983, The New England journal of medicine.

[9]  Christophe Giraud-Carrier,et al.  Dependency Mining on the 2005-06 National Health and Nutrition Examination Survey Data , 2005 .

[10]  David A. Cieslak,et al.  Evaluating Probability Estimates from Decision Trees , 2006 .

[11]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[12]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[13]  Carl E. Rasmussen,et al.  Evaluating Predictive Uncertainty Challenge , 2005, MLCW.

[14]  D R Kafle,et al.  Study of fibrinogen in patients with diabetes mellitus. , 2010, Nepal Medical College journal : NMCJ.

[15]  G Swift,et al.  Preventive Medicine. , 1960, The Journal of the College of General Practitioners.

[16]  R. Boggia,et al.  Genetic algorithms as a strategy for feature selection , 1992 .

[17]  Kagan Tumer,et al.  Theoretical Foundations Of Linear And Order Statistics Combiners For Neural Pattern Classifiers , 1995 .