Addressing Missing Attributes during Data Mining Using Frequent Itemsets and Rough Set Based Predictions

In this paper, we present an improved method for predicting missing attribute values in data sets. We make use of frequent itemsets, generated from the association rules algorithm, displaying the correlations between different items in a set of transactions. In particular, we consider a database as a set of transactions and each data instance as an itemset. Then frequent itemsets can be used as a knowledge base to predict missing attribute values. Our approach integrates the RSFit method based on rough sets theory that produces faster predictions by considering similarities of attribute value pairs, but only for those attributes contained in the core or reduct of the data set. Using empirical studies on UCI and other real world data sets, we demonstrate a significant increase in prediction accuracy obtained from our new integrated approach, referred to as ItemRSFit.

[1]  Xindong Wu,et al.  Cost-constrained data acquisition for intelligent data preparation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[2]  Jiye Li,et al.  Assigning missing attribute values based on rough sets theory , 2006, 2006 IEEE International Conference on Granular Computing.

[3]  Tsau Young Lin,et al.  A New Rough Sets Model Based on Database Systems , 2003, Fundam. Informaticae.

[4]  Jerzy W. Grzymala-Busse,et al.  Coping With Missing Attribute Values Based on Closest Fit in Preterm Birth Data: A Rough Set Approach , 2001, Comput. Intell..

[5]  Y. Yao Information granulation and rough set approximation , 2001 .

[6]  Eyke Hüllermeier,et al.  A systematic approach to the assessment of fuzzy association rules , 2006, Data Mining and Knowledge Discovery.

[7]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[8]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[9]  Christian Borgelt,et al.  EFFICIENT IMPLEMENTATIONS OF APRIORI AND ECLAT , 2003 .

[10]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[11]  Trevor P. Martin,et al.  Improving Access to Multimedia Using Multi-source Hierarchical Meta-data , 2005, Adaptive Multimedia Retrieval.

[12]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[13]  Lotfi A. Zadeh,et al.  Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic , 1997, Fuzzy Sets Syst..

[14]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[15]  Aleksander Øhrn,et al.  Discernibility and Rough Sets in Medicine: Tools and Applications , 2000 .

[16]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[17]  Bernadette Bouchon-Meunier,et al.  Introduction: Databases and fuzziness , 1994, International Journal of Intelligent Systems.

[18]  T. P. Martin,et al.  Acquisition of Soft Taxonomies for Intelligent Personal Hierarchies and the Soft Semantic Web , 2003 .

[19]  Chian-Huei Wun,et al.  Using association rules for completing missing data , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[20]  P. Bosc,et al.  On some fuzzy extensions of association rules , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[21]  Slawomir Zadrozny,et al.  Linguistic summarization of data sets using association rules , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[22]  Jonathan Lawry,et al.  A mass assignment theory of the probability of fuzzy events , 1996, Fuzzy Sets Syst..