论文信息 - A New Data Classification Algorithm for Data-Intensive Computing Environments

A New Data Classification Algorithm for Data-Intensive Computing Environments

Data-intensive computing has received substantial attention since the arrival of the big data era. Research on data mining in data-intensive computing environments is still in the initial stage. In this paper, a decision tree classification algorithm called MR-DIDC is proposed that is based on the programming framework of MapReduce and the SPRINT algorithm. MR-DIDC inherits the advantages of MapReduce, which make the algorithm more suitable for data-intensive computing applications. The performance of the algorithm is evaluated based on an example. The results of experiments showed that MR-DIDC can shorten the operation time and improve the accuracy in a big data environment.

Qi Zhi Deng | Long Bo Zhang | Xin Qian | Ya Li Chen | Feng Ying Wang

[1] Shifeng Liu,et al. A Holographic-based Model for Logistics Resources Integration , 2013 .

[2] Yufeng Wang,et al. Improving Virtual Machine Migration via Deduplication , 2014, 2014 IEEE 11th International Conference on Mobile Ad Hoc and Sensor Systems.

[3] Rakesh Agrawal,et al. SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[4] Roberto J. Bayardo,et al. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[5] Amiya Kumar Rath,et al. A hybridized K-means clustering approach for high dimensional dataset , 2010 .

[6] Pete Wyckoff,et al. Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[7] Jorma Rissanen,et al. SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[8] Das Amrita,et al. Mining Association Rules between Sets of Items in Large Databases , 2013 .

[9] Geoffrey C. Fox,et al. MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[10] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11] Sanjay Ghemawat,et al. MapReduce: a flexible data processing tool , 2010, CACM.

[12] Vasant Honavar,et al. Decision Tree Induction from Distributed Heterogeneous Autonomous Data Sources , 2003 .

[13] Reynold Cheng,et al. Reducing UK-Means to K-Means , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[14] Wang Peng,et al. Review of Programming Models for Data-Intensive Computing , 2010 .

[15] David Jin,et al. Advances in Future Computer and Control Systems , 2012 .

[16] Liang Su,et al. Continuous Kernel-Based Outlier Detection over Distributed Data Streams , 2007, ISPA Workshops.

[17] Ji Genlin,et al. Research and Implementation of ID3 Based on Distributed Database System , 2005 .

[18] Ian Gorton,et al. The Changing Paradigm of Data-Intensive Computing , 2009, Computer.