Parallel Implementation of Chi2 Algorithm in MapReduce Framework

The discretization of continuous attributes is an important pre-processing step for machine learning and data mining. How to efficiently process the discretization of continuous attributes of massive data has become an urgent problem to be resolved. Hadoop as a rising technique in recent years can efficiently process many applications based on massive data. This paper designs and implements a parallel Chi2-based discretization algorithm based on MapReduce model. On the premise of the discretization efficiency, experiments have been done by using different size of data sets in the different nodes. The experimental results show that the proposed algorithm has high efficiency and good scalability to process the discretization of continuous attributes of massive data.

[1]  Francis Eng Hock Tay,et al.  A Modified Chi2 Algorithm for Discretization , 2002, IEEE Trans. Knowl. Data Eng..

[2]  Huajun Chen,et al.  ELM-MapReduce: MapReduce accelerated extreme learning machine for big spatial data analysis , 2013, 2013 10th IEEE International Conference on Control and Automation (ICCA).

[3]  Chao-Ton Su,et al.  An Extended Chi2 Algorithm for Discretization of Real Value Attributes , 2005, IEEE Trans. Knowl. Data Eng..

[4]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[5]  Maozhen Li,et al.  A MapReduce-based distributed SVM ensemble for scalable image classification and annotation , 2013, Comput. Math. Appl..

[6]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[7]  Huan Liu,et al.  Feature Selection via Discretization , 1997, IEEE Trans. Knowl. Data Eng..

[8]  Xiaodong Yue,et al.  Parallel attribute reduction algorithms using MapReduce , 2014, Inf. Sci..

[9]  Ankush Mittal,et al.  Employing discrete Bayes error rate for discretization and feature selection tasks , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..