Groupwise analytics via adaptive MapReduce

Shared-nothing systems such as Hadoop vastly simplify parallel programming when processing disk-resident data whose size exceeds aggregate cluster memory. Such systems incur a significant performance penalty, however, on the important class of “groupwise set-valued analytics” (GSVA) queries in which the data is dynamically partitioned into groups and then a set-valued synopsis is computed for some or all of the groups. Key examples of synopses include top-k sets, bottom-k sets, and uniform random samples. Applications of GSVA queries include micro-marketing, root-cause analysis for problem diagnosis, and fraud detection. A naive approach to executing GSVA queries first reshuffles all of the data so that all records in a group are at the same node and then computes the synopsis for the group. This approach can be extremely inefficient when, as is typical, only a very small fraction of the records in each group actually contribute to the final groupwise synopsis, so that most of the shuffling effort is wasted. We show how to significantly speed up GSVA queries by slightly modifying the shared-nothing environment to allow tasks to occasionally access a small, common data structure; we focus on the Hadoop setting and use the “Adaptive MapReduce” infrastructure of Vernica et al. to implement the data structure. Our approach retains most of the advantages of a system such as Hadoop while significantly improving GSVA query performance, and also allows for incremental updating of query results. Experiments show speedups of up to 5x. Importantly, our new technique can potentially be applied to other shared-nothing systems with disk-resident data.

[1]  Qin Zhang,et al.  Optimal sampling from distributed streams , 2010, PODS '10.

[2]  Peter J. Haas,et al.  Eagle-eyed elephant: split-oriented indexing in Hadoop , 2013, EDBT '13.

[3]  Edith Cohen,et al.  Summarizing data using bottom-k sketches , 2007, PODC '07.

[4]  Chris Jermaine,et al.  Online aggregation for large MapReduce jobs , 2011, Proc. VLDB Endow..

[5]  Pierre L'Ecuyer,et al.  Efficient Jump Ahead for 2-Linear Random Number Generators , 2006, INFORMS J. Comput..

[6]  David P. Woodruff,et al.  Optimal Random Sampling from Distributed Streams Revisited , 2011, DISC.

[7]  H. N. Nagaraja,et al.  Order Statistics, Third Edition , 2005, Wiley Series in Probability and Statistics.

[8]  Andrey Balmin,et al.  Adaptive MapReduce using situation-aware mappers , 2012, EDBT '12.

[9]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[10]  Herbert A. David,et al.  Order Statistics , 2011, International Encyclopedia of Statistical Science.

[11]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[12]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[13]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Ronald L. Graham,et al.  Concrete mathematics - a foundation for computer science , 1991 .

[16]  Volker Markl,et al.  Spinning Fast Iterative Data Flows , 2012, Proc. VLDB Endow..

[17]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[18]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[19]  Pierre L'Ecuyer,et al.  Improved long-period generators based on linear recurrences modulo 2 , 2004, TOMS.

[20]  Alekh Jindal,et al.  Hadoop++ , 2010 .

[21]  Peter J. Haas,et al.  Distinct-value synopses for multiset operations , 2009, CACM.

[22]  Ion Stoica,et al.  Blink and It's Done: Interactive Queries on Very Large Data , 2012, Proc. VLDB Endow..

[23]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.