LOOM: Optimal Aggregation Overlays for In-Memory Big Data Processing

Aggregation underlies the distillation of information from big data. Many well-known basic operations including top-k matching and word count hinge on fast aggregation across large data-sets. Common frameworks including MapReduce support aggregation, but do not explicitly consider or optimize it. Optimizing aggregation however becomes yet more relevant in recent "online" approaches to expressive big data analysis which store data in main memory across nodes. This shifts the bottlenecks from disk I/O to distributed computation and network communication and significantly increases the impact of aggregation time on total job completion time. This paper presents LOOM, a (sub)system for efficient big data aggregation for use within big data analysis frameworks. LOOM efficiently supports two-phased (sub)computations consisting in a first phase performed on individual data sub-sets (e.g., word count, top-k matching) followed by a second aggregation phase which consolidates individual results of the first phase (e.g., count sum, top-k). Using characteristics of an aggregation function, LOOM constructs a specifically configured aggregation overlay to minimize aggregation costs. We present optimality heuristics and experimentally demonstrate the benefits of thus optimizing aggregation overlays using microbenchmarks and real world examples.

[1]  Michael Isard,et al.  Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.

[2]  Praveen Yalagandula,et al.  A scalable distributed information management system , 2004, SIGCOMM 2004.

[3]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[4]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[5]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[6]  Thomas G. Robertazzi,et al.  Distributed computation for a tree network with communication delays , 1990 .

[7]  Gunther H. Weber,et al.  Distributed merge trees , 2013, PPoPP '13.

[8]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[9]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Alvin AuYoung,et al.  Using R for Iterative and Incremental Processing , 2012, HotCloud.

[12]  Yin Zhang,et al.  STAR: Self-Tuning Aggregation for Scalable Monitoring , 2007, VLDB.

[13]  Distributed Machine Learning and Graph Processing with Sparse Matrices Paper # 83 , 2012 .

[14]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[15]  Hyoung Joong Kim,et al.  Optimal load distribution for tree network processors , 1996 .

[16]  Praveen Yalagandula SDIMS: A Scalable Distributed Information Management System , 2004 .