Minimum Spanning Tree Based Classification Model for Massive Data with MapReduce Implementation

Rapid growth of data has provided us with more information, yet challenges the tradition techniques to extract the useful knowledge. In this paper, we propose MCMM, a Minimum spanning tree (MST) based Classification model for Massive data with MapReduce implementation. It can be viewed as an intermediate model between the traditional K nearest neighbor method and cluster based classification method, aiming to overcome their disadvantages and cope with large amount of data. Our model is implemented on Hadoop platform, using its MapReduce programming framework, which is particular suitable for cloud computing. We have done experiments on several data sets including real world data from UCI repository and synthetic data, using Downing 4000 clusters, installed with Hadoop. The results show that our model outperforms KNN and some other classification methods on a general basis with respect to accuracy and scalability.

[1]  Charalampos E. Tsourakakis,et al.  HADI : Fast Diameter Estimation and Mining in Massive Graphs with Hadoop , 2008 .

[2]  Tao Lin,et al.  An Interactive Approach to Building Classification Models by Clustering and Cluster Validation , 2000, IDEAL.

[3]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[4]  Qian Shi,et al.  A clustering-Based KNN improved algorithm CLKNN for text classification , 2010, 2010 2nd International Asia Conference on Informatics in Control, Automation and Robotics (CAR 2010).

[5]  Frank Dehne,et al.  Practical parallel algorithms for minimum spanning trees , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[6]  Robert E. Tarjan,et al.  Efficient algorithms for finding minimum spanning trees in undirected and directed graphs , 1986, Comb..

[7]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[8]  Liheng Jian,et al.  A CUDA-based parallel implementation of K-nearest neighbor algorithm , 2009, 2009 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.

[9]  Robert P. W. Duin,et al.  Minimum spanning tree based one-class classifier , 2009, Neurocomputing.

[10]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[11]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[12]  Sargur N. Srihari,et al.  Fast k-nearest neighbor classification using cluster-based trees , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Kun-Mao Chao,et al.  Spanning trees and optimization problems , 2004, Discrete mathematics and its applications.

[14]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[15]  Tetsuo Asano,et al.  Clustering algorithms based on minimum and maximum spanning trees , 1988, SCG '88.