Diagnostic with incomplete nominal/discrete data

Missing values may be present in data without undermining its use for diagnostic / classification purposes but compromise applicationof readily available software. Surrogate entries can remedy the situation, although the outcome is generally unknown.Discretization of continuous attributes renders all data nominal and is helpful in dealing with missing values; particularly, nospecial handling is required for different attribute types. A number of classifiers exist or can be reformulated for this representation.Some classifiers can be reinvented as data completion methods. In this work the Decision Tree, Nearest Neighbour,and Naive Bayesian methods are demonstrated to have the required aptness. An approach is implemented whereby the enteredmissing values are not necessarily a close match of the true data; however, they intend to cause the least hindrance for classification.The proposed techniques find their application particularly in medical diagnostics. Where clinical data represents anumber of related conditions, taking Cartesian product of class values of the underlying sub-problems allows narrowing downof the selection of missing value substitutes. Real-world data examples, some publically available, are enlisted for testing. Theproposed and benchmark methods are compared by classifying the data before and after missing value imputation, indicating asignificant improvement.

[1]  Walter Daelemans,et al.  Generalization performance of backpropagation learning on a syllabification task , 1992 .

[2]  Harri Niska,et al.  Methods for imputation of missing values in air quality data sets , 2004 .

[3]  Michael G. Kenward,et al.  Multiple Imputation and its Application , 2013 .

[4]  Wan-Chi Siu,et al.  Use of biclustering for missing value imputation in gene expression data , 2013, Artif. Intell. Res..

[5]  Capped K-NN Editing in Definition Lacking Environments , 2013 .

[6]  Abdesselam Bouzerdoum,et al.  A supervised learning approach for imbalanced data sets , 2008, 2008 19th International Conference on Pattern Recognition.

[7]  Chengqi Zhang,et al.  Missing Value Imputation Based on Data Clustering , 2008, Trans. Comput. Sci..

[8]  Craig K. Enders,et al.  Applied Missing Data Analysis , 2010 .

[9]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[10]  Andrew Stranieri,et al.  Feature Selection using Misclassification Counts , 2011, AusDM.

[11]  Geoffrey I. Webb,et al.  Discretization for naive-Bayes learning: managing discretization bias and variance , 2008, Machine Learning.

[12]  Kihoon Yoon,et al.  A data reduction approach for resolving the imbalanced data issue in functional genomics , 2007, Neural Computing and Applications.

[13]  R. Pathak,et al.  Accuracy of hemoglobin A1c imputation using fasting plasma glucose in diabetes research using electronic health records data , 2014 .

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[16]  J. N. K. Rao,et al.  Empirical Likelihood‐based Inference in Linear Models with Missing Data , 2002 .

[17]  Lukasz A. Kurgan,et al.  Impact of imputation of missing values on classification error for discrete data , 2008, Pattern Recognit..

[18]  C. Y. Peng,et al.  Advances in Missing Data Methods and Implications for Educational Research , 2006 .

[19]  José Francisco Martínez Trinidad,et al.  Applying balancing techniques in traffic sign recognition , 2014, Artif. Intell. Res..

[20]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[21]  Rafal Latkowski,et al.  Data Decomposition and Decision Rule Joining for Classification of Data with Missing Values , 2004, Trans. Rough Sets.

[22]  Andrew Stranieri,et al.  KNOWLEDGE DISCOVERY FROM LEGAL DATABASES—USING NEURAL NETWORKS AND DATA MINING TO BUILD LEGAL DECISION SUPPORT SYSTEMS , 2006 .

[23]  M. Pazzani Constructive Induction of Cartesian Product Attributes , 1998 .

[24]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[25]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[26]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[27]  Igor Kononenko,et al.  Machine Learning and Data Mining: Introduction to Principles and Algorithms , 2007 .

[28]  Vincent S. Tseng,et al.  A pre-processing method to deal with missing values by integrating clustering and regression techniques , 2003, Appl. Artif. Intell..

[29]  David W. Aha,et al.  A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[30]  Andrew Stranieri,et al.  Novel Data Mining Techniques for Incomplete Clinical Data in Diabetes Management , 2014 .

[31]  Dragan Gamberger,et al.  Filtering Noisy Instances and Outliers , 2001 .

[32]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[33]  Geert Molenberghs,et al.  Missing Data in Clinical Studies , 2007 .

[34]  Md Zahidul Islam,et al.  A Decision Tree-based Missing Value Imputation Technique for Data Pre-processing , 2011, AusDM.

[35]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.