A Scalable Rough Set Knowledge Reduction Algorithm

Knowledge reduction algorithms based on rough set play an important role in KDD because of its advantage in dealing with uncertain data. However, it is hard for classical rough set knowledge reduction algorithms to deal with huge data sets. A structure of Class Distribution List (CDL) is presented in this paper to express the distribution of all attribute values in the whole sample space. With database technology, a CDL can be generated through classifying the original data sets. Then, a group of rough-set-based knowledge reduction algorithms are revised using CDL. This method can process huge data sets directly. As a framework, CDL method can also be used in other rough set algorithms to improve their scalability without decreasing their accuracy. Efficiency of our algorithms is proved by simulation experiments.

[1]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[2]  Wang Guo,et al.  Decision Table Reduction based on Conditional Information Entropy , 2002 .

[3]  He Qing HSC Classification Method and Its Applications in Massive Data Classifying , 2002 .

[4]  Liu Hongyan A Scalable Classification Algorithm Exploring Database Technology , 2002 .

[5]  Chang Li-yun,et al.  An Approach for Attribute Reduction and Rule Generation Based on Rough Set Theory , 1999 .

[6]  Vipin Kumar,et al.  ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[7]  Yi Zhang,et al.  RIDAS - a rough set based intelligent data analysis system , 2002, Proceedings. International Conference on Machine Learning and Cybernetics.

[8]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[9]  Salvatore J. Stolfo,et al.  An extensible meta-learning approach for scalable and accurate inductive learning , 1996 .

[10]  Toshinori Munakata,et al.  Knowledge discovery , 1999, Commun. ACM.

[11]  Kai-Uwe Sattler,et al.  SQL database primitives for decision tree classifiers , 2001, CIKM '01.

[12]  Georges Gardarin,et al.  Advances in Database Technology — EDBT '96 , 1996, Lecture Notes in Computer Science.

[13]  Sanjay Ranka,et al.  CLOUDS: A Decision Tree Classifier for Large Datasets , 1998, KDD.

[14]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.