Missing Data Imputation Through the Use of the Random Forest Algorithm

This paper presents a comparison of different paradigms used for missing data imputation. The data set used is HIV seroprevalence data from an antenatal clinic study survey performed in 2001. Data imputation is performed through five methods: Random Forests; auto-associative neural networks with genetic algorithms; auto-associative neuro-fuzzy configurations; and two random forest and neural network based hybrids. Results indicate that Random Forests are superior in imputing missing data for the given data set in terms of accuracy and in terms of computation time, with accuracy increases of up to 32 % on average for certain variables when compared with auto-associative networks. While the concept of hybrid systems has promise, the presented systems appear to be hindered by their auto-associative neural network components.

[1]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[2]  B. L. Betechuoh,et al.  Autoencoder networks for HIV classification , 2006 .

[3]  Tshilidzi Marwala,et al.  Missing data: A comparison of neural network and expectation maximization techniques , 2007 .

[4]  Yanjun Qi,et al.  Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources , 2004, Pacific Symposium on Biocomputing.

[5]  Chuen-Tsai Sun,et al.  Neuro-fuzzy modeling and control , 1995, Proc. IEEE.

[6]  J.-S.R. Jang,et al.  Input selection for ANFIS learning , 1996, Proceedings of IEEE 5th International Fuzzy Systems.

[7]  Luc Devroye,et al.  Consistency of Random Forests and Other Averaging Classifiers , 2008, J. Mach. Learn. Res..

[8]  Andries P. Engelbrecht,et al.  Computational Intelligence: An Introduction , 2002 .

[9]  Georgia D. Tourassi,et al.  Impact of missing data in evaluating artificial neural networks trained on complete data , 2006, Comput. Biol. Medicine.

[10]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[11]  Ken P Kleinman,et al.  Much Ado About Nothing , 2007, The American statistician.

[12]  José Mira,et al.  Connectionist Models of Neurons, Learning Processes, and Artificial Intelligence , 2001, Lecture Notes in Computer Science.

[13]  Ajith Abraham,et al.  Neuro Fuzzy Systems: Sate-of-the-Art Modeling Techniques , 2001, IWANN.

[14]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[15]  Peter J. Fleming,et al.  Genetic Algorithms in Engineering Systems , 1997 .

[16]  Tshilidzi Marwala,et al.  The Effect of Structural Diversity of an Ensemble of Classifiers on Classification Accuracy , 2008, ArXiv.

[17]  Tshilidzi Marwala,et al.  Evaluating the Impact of Missing Data Imputation , 2009, ADMA.

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[20]  Satish Iyengar,et al.  Variable selection when confronted with missing data , 2006 .

[21]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[22]  Tshilidzi Marwala,et al.  Computational intelligence and decision trees for missing data estimation , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[23]  K. Yuan,et al.  5. Three Likelihood-Based Methods for Mean and Covariance Structure Analysis with Nonnormal Missing Data , 2000 .

[24]  R. S. Doyle Neurofuzzy modelling and control , 1995 .