Bohr: similarity aware geo-distributed data analytics

We propose Bohr, a similarity aware geo-distributed data analytics system that minimizes query completion time. The key idea is to exploit similarity between data in different data centers (DCs), and transfer similar data from the bottleneck DC to other sites with more WAN bandwidth. Though these sites have more input data to process, these data are more similar and can be more efficiently aggregated by the combiner to reduce the intermediate data that needs to be shuffled across the WAN. Thus our similarity aware approach reduces the shuffle time and in turn the query completion time (QCT). We design Bohr based on OLAP data cubes to perform efficient similarity checking among datasets in different sites. We implement Bohr on Spark and deploy it across ten sites of AWS EC2. Our extensive evaluation using realistic query workloads shows that Bohr improves the QCT by up to 50% and reduces the intermediate data by up to 6x compared to state-of-the-art solutions that also use OLAP cubes.

[1]  AkellaAditya,et al.  Low Latency Geo-distributed Data Analytics , 2015 .

[2]  Albert G. Greenberg,et al.  Scarlett: coping with skewed content popularity in mapreduce clusters , 2011, EuroSys '11.

[3]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[4]  Alberto O. Mendelzon,et al.  Maintaining data cubes under dimension updates , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[5]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[6]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[7]  Adam Wierman,et al.  USENIX Association 11 th USENIX Symposium on Networked Systems Design and Implementation 289 GRASS : Trimming Stragglers in Approximation Analytics , 2014 .

[8]  Jitendra Malik,et al.  Image Retrieval and Classification Using Local Distance Functions , 2006, NIPS.

[9]  Minlan Yu,et al.  Scheduling jobs across geo-distributed datacenters , 2015, SoCC.

[10]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[11]  Onur Mutlu,et al.  Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds , 2017, NSDI.

[12]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[13]  Minlan Yu,et al.  Wide-area analytics with multiple resources , 2018, EuroSys.

[14]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[15]  Paramvir Bahl,et al.  Low Latency Geo-distributed Data Analytics , 2015, SIGCOMM.

[16]  Carlo Curino,et al.  Global Analytics in the Face of Bandwidth and Regulatory Constraints , 2015, NSDI.

[17]  Aditya Akella,et al.  CLARINET: WAN-Aware Optimization for Analytics Queries , 2016, OSDI.

[18]  Michael J. Freedman,et al.  Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area , 2014, NSDI.

[19]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[20]  Alfredo Cuzzocrea,et al.  Data warehousing and OLAP over big data: current challenges and future research directions , 2013, DOLAP '13.

[21]  Christopher Ré,et al.  Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..

[22]  Elaine Shi,et al.  GUPT: privacy preserving data analysis made easy , 2012, SIGMOD Conference.

[23]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[24]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[25]  Antony I. T. Rowstron,et al.  Rhea: Automatic Filtering for Unstructured Cloud Storage , 2013, NSDI.

[26]  Ashish Goel,et al.  Dimension independent similarity computation , 2012, J. Mach. Learn. Res..

[27]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[28]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[29]  Reza Bosagh Zadeh,et al.  Dimension Independent Matrix Square using MapReduce , 2013, ArXiv.

[30]  Christos Faloutsos,et al.  Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.

[31]  Carlo Curino,et al.  WANalytics: Geo-Distributed Analytics for a Data Intensive World , 2015, SIGMOD Conference.