Estimation of Missing Data Using Computational Intelligence and Decision Trees

This paper introduces a novel paradigm to impute missing data that combines a decision tree with an auto-associative neural network (AANN) based model and a principal component analysis-neural network (PCA-NN) based model. For each model, the decision tree is used to predict search bounds for a genetic algorithm that minimize an error function derived from the respective model. The models' ability to impute missing data is tested and compared using HIV sero-prevalance data. Results indicate an average increase in accuracy of 13% with the AANN based model's average accuracy increasing from 75.8% to 86.3% while that of the PCA-NN based model increasing from 66.1% to 81.6%.

[1]  B. L. Betechuoh,et al.  Autoencoder networks for HIV classification , 2006 .

[2]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[3]  T. Marwala,et al.  Treatment of missing data using neural networks and genetic algorithms , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[4]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[5]  Simon Kasif,et al.  OC1: A Randomized Induction of Oblique Decision Trees , 1993, AAAI.

[6]  Ronald K. Pearson,et al.  The problem of disguised missing data , 2006, SKDD.

[7]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[8]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[9]  Hervé Abdi,et al.  A NEURAL NETWORK PRIMER , 1994 .

[10]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[11]  I. Jolliffe Principal Component Analysis , 2002 .

[12]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[13]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[14]  Mark Huisman Post-stratification to correct for nonresponse: classification of ZIP code areas , 2000 .

[15]  Jieping Ye,et al.  GPCA: an efficient dimension reduction scheme for image compression and retrieval , 2004, KDD.

[16]  Heidrun Schumann,et al.  Enhancing the visualization process with principal component analysis to support the exploration of trends , 2006, APVIS.

[17]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[18]  Steven Salzberg,et al.  Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm , 1995, J. Comput. Biol..

[19]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .