Techniques for Dealing with Missing Data in Knowledge Discovery Tasks

Information plays a very important role in our life. Advances in many research fields depend on the ability of discovering knowledge in very large data bases. A lot of businesses base their success on the availability of marketing information. This kind of data is usually big, and not always easy to manage. Scientists from different research areas have developed methods to analyze huge amounts of data and to extract useful information. These methods may extract different kinds of knowledge, depending on the data and on user requirements. In particular, one important knowledge discovery task is supervised learning. Today, there exist many methods to build classifiers, belonging to different fields, such as artificial intelligence, soft computing, statistics. Unfortunately, traditional methods usually cannot deal directly with real-world data, because of missing or wrong items. This report concerns the former problem: the unavailability of some values. The majority of interesting data bases is incomplete, i.e., one or more values are missing inside some records, or some records are missing at all. There exist many techniques to manage data with missing items, but no one is absolutely better then the others. Different situations require different solutions. As Allison says, “the only really good solution to the missing data problem is not to have any” [1]. This report reviews the main missing data techniques (MDTs), trying to highlight their advantages and disadvantages. Next section introduces some terminology and presents a taxonomy of MDTs. Section 3 describes these methods more in detail. Finally, some conclusions are reported.

[1]  John L.P. Thompson,et al.  Missing data , 2004, Amyotrophic lateral sclerosis and other motor neuron disorders : official publication of the World Federation of Neurology, Research Group on Motor Neuron Diseases.

[2]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[3]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Max Bramer,et al.  Techniques for Dealing with Missing Values in Classification , 1997, IDA.

[6]  Ivo Düntsch,et al.  Maximum Consistency of Incomplete Data via Non-Invasive Imputation , 2004, Artificial Intelligence Review.

[7]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[8]  Jerzy W. Grzymala-Busse,et al.  A comparison of three closest fit approaches to missing attribute values in preterm birth data , 2002, Int. J. Intell. Syst..

[9]  Donald B. Rubin,et al.  AN OVERVIEW OF MULTIPLE IMPUTATION , 2002 .

[10]  Zijian Zheng,et al.  Classifying Unseen Cases with Many Missing Values , 1999, PAKDD.

[11]  J. Ross Quinlan,et al.  Unknown Attribute Values in Induction , 1989, ML.

[12]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[13]  Mingxiu Hu,et al.  EVALUATION OF SOME POPULAR IMPUTATION ALGORITHMS , 2002 .

[14]  Richard C. T. Lee,et al.  Towards Automatic Auditing of Records , 1978, IEEE Transactions on Software Engineering.