Analysis of incomplete and inconsistent clinical survey data

It is common for clinical data in survey trials to be incomplete and inconsistent for several reasons. Inconsistent data occur when more than one set of exclusive alternative questions are answered. One objective of this study was to identify and eliminate inconsistent data as an important data mining preprocessing step. We define three types of incomplete data: missing data due to skip pattern (SPMD), undetermined missing data (UMD), and genuine missing data (GMD). Identifying the type of missing data is another important objective as all missing data types cannot be treated the same. This goal cannot be achieved manually on large data of complex surveys since each subject should be processed individually. The analyses are accomplished in a mathematical framework by exploiting graph theoretic structure inherent in the questionnaire. An undirected graph is built using mutually inconsistent responses as well as its complement. The responses not in the largest maximal clique of complement graph are considered inconsistent. This guarantees removing as few responses as possible so that remaining ones are mutually consistent. Further, all potential paths in questionnaire’s graph are considered, based on the responses of subjects, to identify each type of incomplete data. Experiments are conducted on MESA data. Results show 15.4 % GMD, 9.8 % SPMD, 12.9 % UMD, and 0.021 % inconsistent data. Further utility of the approach is using a) the SPMD for data stratification, and b) inconsistent data for noise estimation. Proposed method is a preprocessing prerequisite for any data mining of clinical survey data.

[1]  E. Beale,et al.  Missing Values in Multivariate Analysis , 1975 .

[2]  P. Royston,et al.  Patrick Royston model with a binary outcome A comparison of imputation techniques for handling missing predictor values in a risk , 2007 .

[3]  Andrea Manca,et al.  Handling missing data in patient-level cost-effectiveness analysis alongside randomised clinical trials , 2005, Applied health economics and health policy.

[4]  M B Brown,et al.  Prevalence of urinary incontinence and other urological symptoms in the noninstitutionalized elderly. , 1986, The Journal of urology.

[5]  Harri Niska,et al.  Methods for imputation of missing values in air quality data sets , 2004 .

[6]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[7]  Leonardo Franco,et al.  Missing data imputation using statistical and machine learning methods in a real breast cancer problem , 2010, Artif. Intell. Medicine.

[8]  Lynne E. Parker,et al.  Nearest neighbor imputation using spatial-temporal correlations in wireless sensor networks , 2014, Inf. Fusion.

[9]  Zoran Obradovic,et al.  Imputation of missing links and attributes in longitudinal social surveys , 2011, Machine Learning.

[10]  A Rogier T Donders,et al.  Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. , 2006, Journal of clinical epidemiology.

[11]  Shichao Zhang,et al.  Clustering-based Missing Value Imputation for Data Preprocessing , 2006, 2006 4th IEEE International Conference on Industrial Informatics.

[12]  Carolyn M Sampselle,et al.  Prevention of urinary incontinence by behavioral modification program: a randomized, controlled trial among older women in the community. , 2004, The Journal of urology.

[13]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[14]  M B Brown,et al.  Clinical and cystometric characteristics of continent and incontinent noninstitutionalized elderly. , 1988, The Journal of urology.

[15]  Marcia A. Testa,et al.  A Review of the Quality-of-Life Aspects of Urinary Urge Incontinence , 2012, PharmacoEconomics.

[16]  Jim Fagan,et al.  BUREAU OF THE CENSUS STATISTICAL RESEARCH DIVISION REPORT SERIES SRD Research Report Number: Census/SRD/RR-88106 USING GRAPH THEORY TO ANALYZE SKIP PATTERNS IN QUESTIONNAIRES , 1988 .

[17]  Hai Zhong,et al.  The impact of missing data in the estimation of concentration index: a potential source of bias , 2010, The European Journal of Health Economics.

[18]  T. Chesney,et al.  Imputation methods to deal with missing values when data mining trauma injury data , 2006, 28th International Conference on Information Technology Interfaces, 2006..