Tupleware: Distributed Machine Learning on Small Clusters

There is a fundamental discrepancy between the targeted and actual users of current analytics frameworks. Most systems are designed for the challenges of the Googles and Facebooks of the world— petabytes of data distributed across large cloud deployments consisting of thousands of cheap commodity machines. Yet, the vast majority of users operate clusters ranging from a few to a few dozen nodes, analyze relatively small datasets of up to several terabytes in size, and perform primarily compute-intensive operations. Targeting these users fundamentally changes the way we should build analytics systems. This paper describes our vision for the design of Tupleware, a new system specifically aimed at performing complex analytics (e.g., distributed machine learning) on small clusters. Tupleware’s architecture brings together ideas from the database and compiler communities to create a powerful end-to-end solution for data analysis. Our preliminary results show orders of magnitude performance improvement over alternative systems.

[1]  Justin Talbot,et al.  Phoenix++: modular MapReduce for shared-memory systems , 2011, MapReduce '11.

[2]  Viktor Leis,et al.  Processing in the Hybrid OLTP & OLAP Main-Memory Database System HyPer , 2013, IEEE Data Eng. Bull..

[3]  Michael Stonebraker,et al.  MapReduce: A major step backwards , 2014 .

[4]  Neoklis Polyzotis,et al.  Declarative Systems for Large-Scale Machine Learning , 2012, IEEE Data Eng. Bull..

[5]  Stratis Viglas,et al.  Generating code for holistic query evaluation , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[6]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[7]  John F. Canny,et al.  Big data analytics with small footprint: squaring the cloud , 2013, KDD.

[8]  Benoît Dageville,et al.  Parallel SQL execution in Oracle 10g , 2004, SIGMOD '04.

[9]  Volker Markl,et al.  Peeking into the optimization of data flow programs with MapReduce-style UDFs , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[10]  Liang Lin,et al.  Tenzing a SQL implementation on the MapReduce framework , 2011, Proc. VLDB Endow..

[11]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[12]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[13]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[14]  Steven Hand,et al.  CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[15]  Jinyang Li,et al.  Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[16]  Irving L. Traiger,et al.  System R: A Relational Data Base Management System , 1975, Computer.

[17]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[20]  Antony Rowstron,et al.  Nobody ever got fired for using Hadoop on a cluster , 2012, HotCDP '12.

[21]  Chen Li,et al.  ASTERIX: An Open Source System for "Big Data" Management and Analysis , 2012, Proc. VLDB Endow..

[22]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[23]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[24]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[25]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[26]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[27]  Kunle Olukotun,et al.  OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning , 2011, ICML.

[28]  Hamid Pirahesh,et al.  Compiled Query Execution Engine using JVM , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[29]  Marcin Zukowski,et al.  Vectorwise: A Vectorized Analytical DBMS , 2012, 2012 IEEE 28th International Conference on Data Engineering.