Getting Started with Spark

Cluster computing has seen a rise in improved and popular computing models, in which clusters execute data-parallel computations on unreliable machines. This is enabled by software systems that provide locality-aware scheduling, fault tolerance, and load balancing. MapReduce [1] has become the front runner in pioneering this model, while systems like Map-Reduce-Merge [2] and Dryad [3] have generalized different data flow types. These systems are scalable and fault tolerant because they provide a programming model that enables users in creating acyclic data flow graphs to pass input data through a set of operations. This model enables the system to schedule and react to faults better without any user intervention. While this model can be applied to a lot applications, there are problems that cannot be solved efficiently by acyclic data flows.

[1]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[2]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[3]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[4]  Miguel Castro,et al.  Safe and efficient sharing of persistent objects in Thor , 1996, SIGMOD '96.

[5]  Jinyang Li,et al.  Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[6]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[7]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[8]  James Frew,et al.  Lineage retrieval for scientific data processing: a survey , 2005, CSUR.

[9]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[10]  Kai Li,et al.  IVY: A Shared Virtual Memory System for Parallel Computing , 1988, ICPP.

[11]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[12]  David Gelernter,et al.  Generative communication in Linda , 1985, TOPL.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[15]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[16]  Bill Nitzberg,et al.  Distributed shared memory: a survey of issues and algorithms , 1991, Computer.

[17]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[18]  Anne-Marie Kermarrec,et al.  A recoverable distributed shared memory integrating coherence and recoverability , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[19]  Benjamin Hindman,et al.  A Common Substrate for Cluster Computing , 2009, HotCloud.