Boom analytics: exploring data-centric, declarative programming for the cloud

Building and debugging distributed software remains extremely difficult. We conjecture that by adopting a data-centric approach to system design and by employing declarative programming languages, a broad range of distributed software can be recast naturally in a data-parallel programming model. Our hope is that this model can significantly raise the level of abstraction for programmers, improving code simplicity, speed of development, ease of software evolution, and program correctness. This paper presents our experience with an initial large-scale experiment in this direction. First, we used the Overlog language to implement a "Big Data" analytics stack that is API-compatible with Hadoop and HDFS and provides comparable performance. Second, we extended the system with complex distributed features not yet available in Hadoop, including high availability, scalability, and unique monitoring and debugging facilities. We present both quantitative and anecdotal results from our experience, providing some concrete evidence that both data-centric design and declarative languages can substantially simplify distributed systems programming.

[1]  Michael Stonebraker,et al.  Inclusion of new types in relational data base systems , 1986, 1986 IEEE Second International Conference on Data Engineering.

[2]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[3]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[4]  Michael B. Jones,et al.  Interposition agents: transparently interposing user code at the system interface , 1994, SOSP '93.

[5]  Jennifer Widom,et al.  Constraint checking with partial information , 1994, PODS.

[6]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[7]  Eddie Kohler,et al.  The Click modular router , 1999, SOSP.

[8]  Laura M. Castro,et al.  ARMISTICE: an experience developing management software with Erlang , 2003, ERLANG '03.

[9]  GhemawatSanjay,et al.  The Google file system , 2003 .

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Noah A. Smith,et al.  Dyna: a declarative language for implementing dynamic programs , 2004, ACL 2004.

[12]  Benjamin Livshits,et al.  Context-sensitive program analysis as database queries , 2005, PODS.

[13]  Ion Stoica,et al.  Implementing declarative overlays , 2005, SOSP '05.

[14]  Fan Yang,et al.  Hilda: A High-Level Language for Data-DrivenWeb Applications , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[15]  Eric A. Brewer,et al.  Stasis: flexible transactional storage , 2006, OSDI '06.

[16]  Atul Singh,et al.  Using queries for distributed monitoring and forensics , 2006, EuroSys.

[17]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[18]  Ion Stoica,et al.  Declarative networking: language, execution and optimization , 2006, SIGMOD Conference.

[19]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[20]  Siddhartha S. Srinivasa,et al.  Declarative Programming for Modular Robots , 2007 .

[21]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[22]  Johannes Gehrke,et al.  Scaling games to epic proportions , 2007, SIGMOD '07.

[23]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[24]  Florian Schintke,et al.  Scalaris: reliable transactional p2p key/value store , 2008, ERLANG '08.

[25]  David Chu,et al.  Evita raced: metacompilation for declarative networks , 2008, Proc. VLDB Endow..

[26]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[27]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[28]  Fabrice Marguerie,et al.  LINQ in Action , 2008 .

[29]  Andrea C. Arpaci-Dusseau,et al.  SQCK: A Declarative File System Checker , 2008, OSDI.

[30]  Atul Singh,et al.  BFT Protocols Under Fire , 2008, NSDI.

[31]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[32]  Robert Grimm,et al.  PADS: A Policy Architecture for Distributed Storage Systems , 2009, NSDI.

[33]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[34]  Joseph M. Hellerstein,et al.  BOOM: Data-Centric Programming in the Datacenter , 2009 .

[35]  David Zook,et al.  Declarative Reconfigurable Trust Management , 2009, CIDR.

[36]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[37]  Joseph M. Hellerstein,et al.  I do declare: consensus in a logic language , 2010, OPSR.

[38]  David Maier,et al.  Dedalus: Datalog in Time and Space , 2010, Datalog.