A New Data Classification Algorithm for Data-Intensive Computing Environments

Data-intensive computing has received substantial attention since the arrival of the big data era. Research on data mining in data-intensive computing environments is still in the initial stage. In this paper, a decision tree classification algorithm called MR-DIDC is proposed that is based on the programming framework of MapReduce and the SPRINT algorithm. MR-DIDC inherits the advantages of MapReduce, which make the algorithm more suitable for data-intensive computing applications. The performance of the algorithm is evaluated based on an example. The results of experiments showed that MR-DIDC can shorten the operation time and improve the accuracy in a big data environment.

[1]  Shifeng Liu,et al.  A Holographic-based Model for Logistics Resources Integration , 2013 .

[2]  Yufeng Wang,et al.  Improving Virtual Machine Migration via Deduplication , 2014, 2014 IEEE 11th International Conference on Mobile Ad Hoc and Sensor Systems.

[3]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[4]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[5]  Amiya Kumar Rath,et al.  A hybridized K-means clustering approach for high dimensional dataset , 2010 .

[6]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[7]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[8]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[9]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[12]  Vasant Honavar,et al.  Decision Tree Induction from Distributed Heterogeneous Autonomous Data Sources , 2003 .

[13]  Reynold Cheng,et al.  Reducing UK-Means to K-Means , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[14]  Wang Peng,et al.  Review of Programming Models for Data-Intensive Computing , 2010 .

[15]  David Jin,et al.  Advances in Future Computer and Control Systems , 2012 .

[16]  Liang Su,et al.  Continuous Kernel-Based Outlier Detection over Distributed Data Streams , 2007, ISPA Workshops.

[17]  Ji Genlin,et al.  Research and Implementation of ID3 Based on Distributed Database System , 2005 .

[18]  Ian Gorton,et al.  The Changing Paradigm of Data-Intensive Computing , 2009, Computer.