Approximate Distributed Joins in Apache Spark

The join operation is a fundamental building block of parallel data processing. Unfortunately, it is very resource-intensive to compute an equi-join across massive datasets. The approximate computing paradigm allows users to trade accuracy and latency for expensive data processing operations. The equi-join operator is thus a natural candidate for optimization using approximation techniques. Although sampling-based approaches are widely used for approximation, sampling over joins is a compelling but challenging task regarding the output quality. Naive approaches, which perform joins over dataset samples, would not preserve statistical properties of the join output. To realize this potential, we interweave Bloom filter sketching and stratified sampling with the join computation in a new operator, ApproxJoin, that preserves the statistical properties of the join output. ApproxJoin leverages a Bloom filter to avoid shuffling non-joinable data items around the network and then applies stratified sampling to obtain a representative sample of the join output. Our analysis shows that ApproxJoin scales well and significantly reduces data movement, without sacrificing tight error bounds on the accuracy of the final results. We implemented ApproxJoin in Apache Spark and evaluated ApproxJoin using microbenchmarks and real-world case studies. The evaluation shows that ApproxJoin achieves a speedup of 6-9x over unmodified Spark-based joins with the same sampling rate. Furthermore, the speedup is accompanied by a significant reduction in the shuffled data volume, which is 5-82x less than unmodified Spark-based joins.

[1]  Christof Fetzer,et al.  StreamApprox: approximate computing for stream analytics , 2017, Middleware.

[2]  George Varghese,et al.  An Improved Construction for Counting Bloom Filters , 2006, ESA.

[3]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[4]  Barzan Mozafari,et al.  Approximate Query Engines: Commercial Challenges and Research Opportunities , 2017, SIGMOD Conference.

[5]  Carsten Binnig,et al.  Revisiting Reuse for Approximate Query Processing , 2017, Proc. VLDB Endow..

[6]  Jacob Nelson,et al.  Approximate storage in solid-state memories , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Byung Suk Lee,et al.  Stratified Reservoir Sampling over Heterogeneous Data Streams , 2010, SSDBM.

[8]  Simon J. Godsill,et al.  On sequential Monte Carlo sampling methods for Bayesian filtering , 2000, Stat. Comput..

[9]  Tao Zou,et al.  Joins for Hybrid Warehouses: Exploiting Massive Parallelism in Hadoop and Enterprise Data Warehouses , 2015, EDBT.

[10]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[11]  Neeraj Kumar,et al.  SnappyData: A Hybrid Transactional Analytical Store Built On Spark , 2016, SIGMOD Conference.

[12]  Arnab Nandi,et al.  Perfect and Maximum Randomness in Stratified Sampling over Joins , 2016, ArXiv.

[13]  Christof Fetzer,et al.  Privacy Preserving Stream Analytics: The Marriage of Randomized Response and Approximate Computing , 2017, ArXiv.

[14]  Arnab Nandi,et al.  A Unified Correlation-based Approach to Sampling Over Joins , 2017, SSDBM.

[15]  Chris Jermaine,et al.  Online aggregation for large MapReduce jobs , 2011, Proc. VLDB Endow..

[16]  Zhenyu Wen,et al.  ApproxIoT: Approximate Analytics for Edge Computing , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[17]  Swaminathan Natarajan Imprecise and Approximate Computation , 1995 .

[18]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[19]  Hyoung-Joo Kim,et al.  Join processing using Bloom filter in MapReduce , 2012, RACS.

[20]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[21]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[22]  Tim Kraska,et al.  Approximate Query Processing for Interactive Data Science , 2017, SIGMOD Conference.

[23]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[24]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[25]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[26]  Christof Fetzer,et al.  PrivApprox: Privacy-Preserving Stream Analytics , 2019, Informatik Spektrum.

[27]  Woongki Baek,et al.  Green: a framework for supporting energy-conscious programming using controlled approximation , 2010, PLDI '10.

[28]  Pramod Bhatotia,et al.  Slider: incremental sliding window analytics , 2014, Middleware.

[29]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[30]  Wen-Chi Hou,et al.  CS2: a new database synopsis for query estimation , 2013, SIGMOD '13.

[31]  Sam Lightstone,et al.  Memory-Efficient Hash Joins , 2014, Proc. VLDB Endow..

[32]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[33]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[34]  Pramod Bhatotia,et al.  Incoop: MapReduce for incremental computations , 2011, SoCC.

[35]  Bin Wu,et al.  Wander Join: Online Aggregation via Random Walks , 2016, SIGMOD Conference.

[36]  Srikanth Kandula,et al.  Approximate Query Processing: No Silver Bullet , 2017, SIGMOD Conference.

[37]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[38]  David Hutchison,et al.  Scalable Bloom Filters , 2007, Inf. Process. Lett..

[39]  Jignesh M. Patel,et al.  DAQ: A New Paradigm for Approximate Query Processing , 2015, Proc. VLDB Endow..

[40]  Michael T. Goodrich,et al.  Invertible bloom lookup tables , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[41]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[42]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[43]  Sharon L. Lohr,et al.  Sampling: Design and Analysis , 1999 .

[44]  Tao Zou,et al.  Building a Hybrid Warehouse , 2016, ACM Trans. Database Syst..

[45]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[46]  Christof Fetzer,et al.  IncApprox: A Data Analytics System for Incremental Approximate Computing , 2016, WWW.