论文信息 - Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means

Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means

Data has long been the topic of fascination for Computer Science enthusiasts around the world, and has gained even more prominence in the recent times with the continuous explosion of data resulting from the likes of social media and the quest for tech giants to gain access to deeper analysis of their data. This paper discusses two of the comparison of - Hadoop Map Reduce and the recently introduced Apache Spark - both of which provide a processing model for analyzing big data. Although both of these options are based on the concept of Big Data, their performance varies significantly based on the use case under implementation. This is what makes these two options worthy of analysis with respect to their variability and variety in the dynamic field of Big Data. In this paper we compare these two frameworks along with providing the performance analysis using a standard machine learning algorithm for clustering (K- Means).

Rohan Arora | Satish Gopalani | Satish Gopalani | R. Arora

[1] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[2] A. Davidson. Optimizing Shuffle Performance in Spark , 2013 .

[3] Scott Shenker,et al. Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[4] GhemawatSanjay,et al. The Google file system , 2003 .

[5] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.