Tackling Missing Data in Community Health Studies Using Additive LS-SVM Classifier

Missing data is a common issue in community health and epidemiological studies. Direct removal of samples with missing data can lead to reduced sample size and information bias, which deteriorates the significance of the results. While data imputation methods are available to deal with missing data, they are limited in performance and could introduce noises into the dataset. Instead of data imputation, a novel method based on additive least square support vector machine (LS-SVM) is proposed in this paper for predictive modeling when the input features of the model contain missing data. The method also determines simultaneously the influence of the features with missing values on the classification accuracy using the fast leave-one-out cross-validation strategy. The performance of the method is evaluated by applying it to predict the quality of life (QOL) of elderly people using health data collected in the community. The dataset involves demographics, socioeconomic status, health history, and the outcomes of health assessments of 444 community-dwelling elderly people, with 5% to 60% of data missing in some of the input features. The QOL is measured using a standard questionnaire of the World Health Organization. Results show that the proposed method outperforms four conventional methods for handling missing data—case deletion, feature deletion, mean imputation, and K-nearest neighbor imputation, with the average QOL prediction accuracy reaching 0.7418. It is potentially a promising technique for tackling missing data in community health research and other applications.

[1]  B. Schölkopf,et al.  Max-margin classification of incomplete data , 2007 .

[2]  Thomas Hofmann,et al.  Kernel Methods for Missing Variables , 2005, AISTATS.

[3]  Rudolf Kruse,et al.  Learning in neuro-fuzzy systems with symbolic attributes and missing values , 1999, ICONIP'99. ANZIIS'99 & ANNES'99 & ACNN'99. 6th International Conference on Neural Information Processing. Proceedings (Cat. No.99EX378).

[4]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[5]  W. Beam,et al.  A 30-s chair-stand test as a measure of lower body strength in community-residing older adults. , 1999, Research quarterly for exercise and sport.

[6]  Robert P. W. Duin,et al.  Combining One-Class Classifiers to Classify Missing Data , 2004, Multiple Classifier Systems.

[7]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[8]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[9]  Lisa C. Blum,et al.  Usefulness of the Berg Balance Scale in Stroke Rehabilitation: A Systematic Review , 2008, Physical Therapy.

[10]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Diane Podsiadlo,et al.  The Timed “Up & Go”: A Test of Basic Functional Mobility for Frail Elderly Persons , 1991, Journal of the American Geriatrics Society.

[13]  Kan-ichi Mimura,et al.  Validation of a 30-sec chair-stand test for evaluating lower extremity muscle strength in Japanese elderly adults , 2002 .

[14]  J G Ibrahim,et al.  Monte Carlo EM for Missing Covariates in Parametric Regression Models , 1999, Biometrics.

[15]  M. Woollacott,et al.  Predicting the probability for falls in community-dwelling older adults using the Timed Up & Go Test. , 2000, Physical therapy.

[16]  Johan A. K. Suykens,et al.  Handling missing values in support vector machine classifiers , 2005, Neural Networks.

[17]  Hidetomo Ichihashi,et al.  Fuzzy c-Means Classifier for Incomplete Data Sets with Outliers and Missing Values , 2005, International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06).

[18]  Johan A. K. Suykens,et al.  Componentwise Least Squares Support Vector Machines , 2005, ArXiv.

[19]  Jinbo Bi,et al.  Support Vector Classification with Input Data Uncertainty , 2004, NIPS.

[20]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[21]  Ken P Kleinman,et al.  Much Ado About Nothing , 2007, The American statistician.

[22]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[23]  D J Macfarlane,et al.  Validity and normative data for thirty‐second chair stand test in elderly community‐dwelling Hong Kong Chinese , 2006, American journal of human biology : the official journal of the Human Biology Council.

[24]  Robi Polikar,et al.  An ensemble of classifiers approach for the missing feature problem , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[25]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[26]  Gavin C. Cawley,et al.  Leave-One-Out Cross-Validation Based Model Selection Criteria for Weighted LS-SVMs , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[27]  Sophie Midenet,et al.  Self-Organising Map for Data Imputation and Correction in Surveys , 2002, Neural Computing & Applications.

[28]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[29]  T. Moon The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[30]  Kup-Sze Choi,et al.  Healthcare Information System: A Facilitator of Primary Care for Underprivileged Elderly via Mobile Clinic , 2013, ICSH.

[31]  Alexander J. Smola,et al.  Second Order Cone Programming Approaches for Handling Missing and Uncertain Data , 2006, J. Mach. Learn. Res..

[32]  Hilde Feys,et al.  Effect of a physical therapeutic intervention for balance problems in the elderly: A single-blind, randomized, controlled multicentre trial , 2001, Clinical rehabilitation.

[33]  Aníbal R. Figueiras-Vidal,et al.  Multi-task Neural Networks for Dealing with Missing Inputs , 2007, IWINAC.

[34]  K. F. Leung,et al.  Development and validation of the interview version of the Hong Kong Chinese WHOQOL-BREF , 2005, Quality of Life Research.

[35]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.