A Comparison of the Quality of Rule Induction from Inconsistent Data Sets and Incomplete Data Sets

In data mining, decision rules induced from known examples are used to classify unseen cases. There are various rule induction algorithms, such as LEM1 (Learning from Examples Module version 1), LEM2 (Learning from Examples Module version 2) and MLEM2 (Modified Learning from Examples Module version 2). In the real world, many data sets are imperfect, either inconsistent or incomplete. The idea of lower and upper approximations, or more generally, the probabilistic approximation, provides an effective way to induce rules from inconsistent data sets and incomplete data sets. But the accuracies of rule sets induced from imperfect data sets are expected to be lower. The objective of this project is to investigate which kind of imperfect data sets (inconsistent or incomplete) is worse in terms of the quality of rule induction. In this project, experiments were conducted on eight inconsistent data sets and eight incomplete data sets with lost values. We implemented the MLEM2 algorithm to induce certain and possible rules from inconsistent data sets, and implemented the local probabilistic version of MLEM2 algorithm to induce certain and possible rules from incomplete data sets. A program called Rule Checker was also developed to classify unseen cases with induced rules and measure the classification error rate. Ten-fold cross validation was carried out and the average error rate was used as the criterion for comparison. The Mann-Whitney nonparametric tests were performed to compare, separately for certain and possible rules, incompleteness with inconsistency. The results show that there is no significant difference between inconsistent and incomplete data sets in terms of the quality of rule induction.

[1]  Jerzy W. Grzymala-Busse,et al.  MLEM2 Rule Induction Algorithms: With and Without Merging Intervals , 2008, Data Mining: Foundations and Practice.

[2]  Jerzy W. Grzymala-Busse,et al.  Characteristic Relations for Incomplete Data: A Generalization of the Indiscernibility Relation , 2005, Trans. Rough Sets.

[3]  Jerzy W. Grzymala-Busse,et al.  Generalized probabilistic approximations of incomplete data , 2014, Int. J. Approx. Reason..

[4]  Jerzy W. Grzymala-Busse,et al.  LERS-A System for Learning from Examples Based on Rough Sets , 1992, Intelligent Decision Support.

[5]  Jerzy W. Grzymala-Busse,et al.  Local Probabilistic Approximations for Incomplete Data , 2012, ISMIS.

[6]  Jerzy W. Grzymala-Busse,et al.  Classification Strategies Using Certain and Possible Rules , 1998, Rough Sets and Current Trends in Computing.

[7]  Jerzy W. Grzymala-Busse,et al.  Local and Global Approximations for Incomplete Data , 2006, Trans. Rough Sets.

[8]  Jerzy W. Grzymala-Busse,et al.  Mining incomplete data with singleton, subset and concept probabilistic approximations , 2014, Inf. Sci..

[9]  Jerzy W. Grzymala-Busse,et al.  A Local Version of the MLEM2 Algorithm for Rule Induction , 2010, Fundam. Informaticae.

[10]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[11]  Jerzy W. Grzymala-Busse,et al.  Experiments on probabilistic approximations , 2011, 2011 IEEE International Conference on Granular Computing.

[12]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[13]  J.W. Gryzmala-Busse,et al.  Classification and rule induction based on rough sets , 1996, Proceedings of IEEE 5th International Fuzzy Systems.

[14]  J. Grzymala-Busse,et al.  Rough Set Approaches to Rule Induction from Incomplete Data , 2004 .