Abstract Many of analysis tasks have to deal with missing values and have developed specific and internal treatments to guess them. In this paper we present an external method, MVC (Missing Values Completion), to improve performances of completion and also declarativity and interactions with the user for this problem. Such qualities will allow to use it for the data cleaning step of the Knowledge Discovery in Databases (KDD) process (U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, From data mining to knowledge discovery: an overview, in: Advances in Knowledge Discovery and Data Mining, MIT Press, Cambridge, MA, USA, 1996, pp. 1–36). The core of MVC, is the Robust Association Rules (RAR) algorithm that we have proposed earlier (A. Ragel, B Cremilleux, Treatment of missing values for association rules, in: Proceedings of the Second Pacific–Asia Conference on Knowledge Discovery and Data Mining (PAKDD-98), Melbourne, Australia, Lecture Notes in Artificial Intelligence 1394, Springer, Berlin, 1998, pp. 258–270). This algorithm extends the concept of association rules (R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of the ACM SIGMOD Conference on Management of Data, Washington, DC, USA, 1993, pp. 207–216) for databases with multiple missing values. It allows MVC to be an efficient preprocessing method: in our experiments with the c4.5 (J.R. Quilan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, USA, 1993) decision tree program, MVC has permitted to divide, up to two, the error rate in classification, independently of a significant gain of declarativity.
[1]
J. Ross Quinlan,et al.
Unknown Attribute Values in Induction
,
1989,
ML.
[2]
Padhraic Smyth,et al.
From Data Mining to Knowledge Discovery: An Overview
,
1996,
Advances in Knowledge Discovery and Data Mining.
[3]
Gilles Celeux.
Le traitement des donnees manquantes dans le logiciel SICLA
,
1988
.
[4]
Leo Breiman,et al.
Classification and Regression Trees
,
1984
.
[5]
Robert P. Goldman,et al.
Imputation of Missing Data Using Machine Learning Techniques
,
1996,
KDD.
[6]
Tomasz Imielinski,et al.
Mining association rules between sets of items in large databases
,
1993,
SIGMOD Conference.
[7]
Hannu Toivonen,et al.
Sampling Large Databases for Association Rules
,
1996,
VLDB.
[8]
D. Rubin,et al.
Statistical Analysis with Missing Data
,
1988
.
[9]
Ronald L. Rivest,et al.
Constructing Optimal Binary Decision Trees is NP-Complete
,
1976,
Inf. Process. Lett..
[10]
James Kelly,et al.
AutoClass: A Bayesian Classification System
,
1993,
ML.
[11]
Max Bramer,et al.
Techniques for Dealing with Missing Values in Classification
,
1997,
IDA.
[12]
Heikki Mannila,et al.
Fast Discovery of Association Rules
,
1996,
Advances in Knowledge Discovery and Data Mining.
[13]
J. Ross Quinlan,et al.
C4.5: Programs for Machine Learning
,
1992
.
[14]
Aiko M. Hormann,et al.
Programs for Machine Learning. Part I
,
1962,
Inf. Control..
[15]
Bruno Crémilleux,et al.
Treatment of Missing Values for Association Rules
,
1998,
PAKDD.