Multi-Query Optimization in Wide-Area Streaming Analytics

Wide-area data analytics has gained much attention in recent years due to the increasing need for analyzing data that are geographically distributed. Many of such queries often require real-time analysis on data streams that are continuously being generated across multiple locations. Yet, analyzing these geo-distributed data streams in a timely manner is very challenging due to the highly heterogeneous and limited bandwidth availability of the wide-area network (WAN). This paper examines the opportunity of applying multi-query optimization in the context of wide-area streaming analytics, with the goal of utilizing WAN bandwidth efficiently while achieving high throughput and low latency execution. Our approach is based on the insight that many streaming analytics queries often exhibit common executions, whether in consuming a common set of input data or performing the same data processing. In this work, we study different types of sharing opportunities and propose a practical online algorithm that allows streaming analytics queries to share their common executions incrementally. We further address the importance of WAN awareness in applying multi-query optimization. Without WAN awareness, sharing executions in a wide-area environment may lead to performance degradation. We have implemented our WAN-aware multi-query optimization in a prototype implementation based on Apache Flink. Experimental evaluation using Twitter traces on a real wide-area system deployment across geo-distributed EC2 data centers shows that our technique is able to achieve 21% higher throughput while saving WAN bandwidth consumption by 33% compared to a WAN-aware, sharing-agnostic system.

[1]  Margo I. Seltzer,et al.  Network-Aware Operator Placement for Stream-Processing Systems , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[2]  Haifeng Jiang,et al.  Photon: fault-tolerant and scalable joining of continuous data streams , 2013, SIGMOD '13.

[3]  Scott Shenker,et al.  Monarch: Gaining Command on Geo-Distributed Graph Analytics , 2018, HotCloud.

[4]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[5]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[6]  Anastasia Ailamaki,et al.  QPipe: a simultaneously pipelined relational query engine , 2005, SIGMOD '05.

[7]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[8]  Steven Hand,et al.  CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[9]  Rajeev Rastogi,et al.  Sketch-Based Multi-Query Processing over Data Streams , 2004, Data Stream Management.

[10]  Walid G. Aref,et al.  SINA: scalable incremental processing of continuous queries in spatio-temporal databases , 2004, SIGMOD '04.

[11]  Gustavo Alonso,et al.  SharedDB: Killing One Thousand Queries With One Stone , 2012, Proc. VLDB Endow..

[12]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[13]  Minlan Yu,et al.  Wide-area analytics with multiple resources , 2018, EuroSys.

[14]  Lenin Ravindranath,et al.  Nectar: Automatic Management of Data and Computation in Datacenters , 2010, OSDI.

[15]  Ramesh K. Sitaraman,et al.  Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics , 2016, SoCC.

[16]  Seif Haridi,et al.  State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing , 2017, Proc. VLDB Endow..

[17]  Wolfgang Lehner,et al.  Efficient exploitation of similar subexpressions for query processing , 2007, SIGMOD '07.

[18]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.

[19]  Minlan Yu,et al.  Scheduling jobs across geo-distributed datacenters , 2015, SoCC.

[20]  George Candea,et al.  A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses , 2009, Proc. VLDB Endow..

[21]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[22]  Abhishek Chandra,et al.  Rethinking Adaptability in Wide-Area Stream Processing Systems , 2018, HotCloud.

[23]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[24]  Johannes Gehrke,et al.  Rule-based multi-query optimization , 2009, EDBT '09.

[25]  Feifei Li,et al.  Scalable Multi-query Optimization for SPARQL , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[26]  Mario Vento,et al.  A (sub)graph isomorphism algorithm for matching large graphs , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[28]  Carlo Curino,et al.  WANalytics: Geo-Distributed Analytics for a Data Intensive World , 2015, SIGMOD Conference.

[29]  Guoping Wang,et al.  Multi-Query Optimization in MapReduce Framework , 2013, Proc. VLDB Endow..

[30]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[31]  Ali Ghodsi,et al.  Drizzle: Fast and Adaptable Stream Processing at Scale , 2017, SOSP.

[32]  Carlo Curino,et al.  Global Analytics in the Face of Bandwidth and Regulatory Constraints , 2015, NSDI.

[33]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[34]  Michael J. Franklin,et al.  Streaming Queries over Streaming Data , 2002, VLDB.

[35]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[36]  Christopher Olston,et al.  Automatic Optimization of Parallel Dataflow Programs , 2008, USENIX Annual Technical Conference.

[37]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[38]  Ankur P. Parikh,et al.  Algorithms for Graph Similarity and Subgraph Matching , 2011 .

[39]  Sheldon J. Finkelstein Common expression analysis in database applications , 1982, SIGMOD '82.

[40]  Paramvir Bahl,et al.  Low Latency Geo-distributed Data Analytics , 2015, SIGCOMM.

[41]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[42]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[43]  Onur Mutlu,et al.  Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds , 2017, NSDI.

[44]  Subramanian Arumugam,et al.  The DataPath system: a data-centric analytic processing engine for large data warehouses , 2010, SIGMOD Conference.

[45]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[46]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[47]  Ling Liu,et al.  Optimizing Multiple Distributed Stream Queries Using Hierarchical Network Partitions , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[48]  Ramesh K. Sitaraman,et al.  Optimizing Grouped Aggregation in Geo-Distributed Streaming Analytics , 2015, HPDC.

[49]  Elke A. Rundensteiner,et al.  Shared Execution of Recurring Workloads in MapReduce , 2015, Proc. VLDB Endow..

[50]  Aditya Akella,et al.  CLARINET: WAN-Aware Optimization for Analytics Queries , 2016, OSDI.

[51]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[52]  Srikanth Kandula,et al.  Achieving high utilization with software-driven WAN , 2013, SIGCOMM.

[53]  Brian F. Cooper,et al.  Optimizing Multiple Queries in Distributed Data Stream Systems , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[54]  Jeyhun Karimov,et al.  Benchmarking Distributed Stream Data Processing Systems , 2019, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[55]  Zhuo Liu,et al.  Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[56]  Ethan Katz-Bassett,et al.  SPANStore: cost-effective geo-replicated storage spanning multiple cloud services , 2013, SOSP.

[57]  Michael J. Freedman,et al.  Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area , 2014, NSDI.

[58]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[59]  Jeong-Hyon Hwang,et al.  Fast and Reliable Stream Processing over Wide Area Networks , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[60]  Jeyhun Karimov,et al.  Benchmarking Distributed Stream Processing Engines , 2018, ICDE.

[61]  Wei Lin,et al.  StreamScope: Continuous Reliable Distributed Processing of Big Data Streams , 2016, NSDI.

[62]  Michael Chow,et al.  This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Dqbarge: Improving Data-quality Tradeoffs in Large-scale Internet Services Dqbarge: Improving Data-quality Tradeoffs in Large-scale Internet Services , 2022 .

[63]  Mohamed F. Mokbel,et al.  GARNET: A holistic system approach for trending queries in microblogs , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[64]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.