Parallel Decision Tree with Application to Water Quality Data Analysis

Decision tree is a popular classification technique in many applications, such as retail target marketing, fraud detection and design of telecommunication service plans. With the information exploration, the existing classification algorithms are not good enough to tackle large data set. In order to deal with the problem, many researchers try to design efficient parallel classification algorithms. Based on the current and powerful parallel programming framework -- MapReduce, we propose a parallel ID3 classification algorithm(PID3 for short). We use water quality data monitoring the Changjiang River which contains 17 branches as experimental data. As the data are time series, we process the data to attribute data before using the decision tree. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

[1]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[2]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Qing He,et al.  A parallel Hyper-Surface Classifier for high dimensional data , 2010, 2010 Third International Symposium on Knowledge Acquisition and Modeling.

[5]  Vipin Kumar,et al.  Parallel Formulations of Decision-Tree Classification Algorithms , 2004, Data Mining and Knowledge Discovery.

[6]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[7]  Brian Hayes,et al.  What Is Cloud Computing? , 2019, Cloud Technologies.

[8]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[9]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[10]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[11]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[12]  GhemawatSanjay,et al.  The Google file system , 2003 .

[13]  Nitesh V. Chawla,et al.  A parallel decision tree builder for mining very large visualization datasets , 2000, SMC.