论文信息 - A New Data Classification Algorithm for Data-Intensive Computing Environments

A New Data Classification Algorithm for Data-Intensive Computing Environments

In order to solve the problem of how to improve the scalability of data processing capabilities and the data availability which encountered by data mining techniques for Data-intensive computing, a new method of tree learning is presented in this paper. By introducing the MapReduce, the tree learning method based on SPRINT can obtain a well scalability when address large datasets. Moreover, we define the process of split point as a series of distributed computations, which is implemented with the MapReduce model respectively. And a new data structure called class distribution table is introduced to assist the calculation of histogram. Experiments and results analysis shows that the algorithm has strong processing capabilities of data mining for data-intensive computing environments.

Qi Zhi Deng | Xin Qian | Feng Ying Wang | Ya Li Chen | Long Bo Zhang

[1] Jorma Rissanen,et al. SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[2] Ji Genlin,et al. Research and Implementation of ID3 Based on Distributed Database System , 2005 .

[3] Wang Peng,et al. Review of Programming Models for Data-Intensive Computing , 2010 .

[4] Geoffrey C. Fox,et al. MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[5] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6] Pete Wyckoff,et al. Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[7] Rakesh Agrawal,et al. SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[8] Roberto J. Bayardo,et al. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[9] Sanjay Ghemawat,et al. MapReduce: a flexible data processing tool , 2010, CACM.

[10] Vasant Honavar,et al. Decision Tree Induction from Distributed Heterogeneous Autonomous Data Sources , 2003 .

[11] Ian Gorton,et al. The Changing Paradigm of Data-Intensive Computing , 2009, Computer.