The k-Nearest Neighbor Algorithm Using MapReduce Paradigm

Data in any form is a valuable resource but more often than not data collected in the real world is completely random and unstructured. Hence, to utilize the true potential of data as a resource we must transform it in such a manner so as to retrieve meaningful information from it. Data mining fulfills this need. Today there is not only a need for efficient data mining techniques to process large volume of data but also a need for a means to meet the computational requirements to process such huge volume of data. In this paper we implement an effective data mining technique known as the k-Nearest Neighbor method on a distributed computing environment running Apache Hadoop that uses the MapReduce paradigm to process high volume data.

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[3]  Benjamin Reed,et al.  The life and times of a zookeeper , 2009, PODC '09.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Anjan K. Koundinya,et al.  MapReduce Design of K-Means Clustering Algorithm , 2013, 2013 International Conference on Information Science and Applications (ICISA).

[6]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[7]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[8]  GhemawatSanjay,et al.  The Google file system , 2003 .