Analysis and Improvement of SPRINT Algorithm Based on Hadoop

With the rapid development of computers and networks, the growth of data causes the data mining increasingly difficult. To solve this problem, this paper proposes an improved SPRINT algorithm based on the Hadoop platform. By analyzing the traditional SPRINT algorithm, we improve it in three aspects: eliminate unnecessary and repetitive calculations in the processing of discrete attributes; none presort of continuous attributes and split by line directly when splitting; and add the node field for attributes list in the data structure. For illustration, a performance test of acceleration and accuracy is executed to prove the effectiveness of the improved SPRINT algorithm. Compared to the original SPRINT algorithm, experimental result shows that the improved SPRINT algorithm guarantees the accuracy and reduces the computing time for the best split point thus accelerates the speed of decision-tree construction.

[1]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[2]  Johannes Gehrke,et al.  Mining Very Large Databases , 1999, Computer.

[3]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[4]  Jeffrey Dean,et al.  Keynote talk: Experiences with MapReduce, an abstraction for large-scale computation , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Cai Jing-feng Decision Tree Technique and its Current Research , 2005 .

[6]  Chen Guo-qing Review of classification algorithms for data mining , 2002 .

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Qiu Lu,et al.  The research of decision tree mining based on Hadoop , 2012, 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery.

[9]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[10]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[11]  Beizhan Wang,et al.  Cloud computing and its key techniques , 2011, 2011 IEEE International Conference on Computer Science and Automation Engineering.