Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics

Many applications must ingest rapid data streams and produce analytics results in near-real-time. It is increasingly common for inputs to such applications to originate from geographically distributed sources. The typical infrastructure for processing such geo-distributed streams follows a hub-and-spoke model, where several edge servers perform partial computation before forwarding results over a wide-area network (WAN) to a central location for final processing. Due to limited WAN bandwidth, it is not always possible to produce exact results. In such cases, applications must either sacrifice timeliness by allowing delayed---i.e., stale---results, or sacrifice accuracy by allowing some error in final results. In this paper, we focus on windowed grouped aggregation, an important and widely used primitive in streaming analytics, and we study the tradeoff between staleness and error. We present optimal offline algorithms for minimizing staleness under an error constraint and for minimizing error under a staleness constraint. Using these offline algorithms as references, we present practical online algorithms for effectively trading off timeliness and accuracy under bandwidth limitations. Using a workload derived from an analytics service offered by a large commercial CDN, we demonstrate the effectiveness of our techniques through both trace-driven simulation as well as experiments on an Apache Storm-based implementation deployed on PlanetLab. Our experiments show that our proposed algorithms reduce staleness by 81.8% to 96.6%, and error by 83.4% to 99.1% compared to a practical random sampling/batching-based aggregation algorithm across a diverse set of aggregation functions.

[1]  Ramesh K. Sitaraman,et al.  The Akamai network: a platform for high-performance internet applications , 2010, OPSR.

[2]  Shlomo Zilberstein,et al.  Operational Rationality through Compilation of Anytime Algorithms , 1995, AI Mag..

[3]  Jeong-Hyon Hwang,et al.  Fast and Highly-Available Stream Processing over Wide Area Networks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[4]  Graham Cormode,et al.  Continuous distributed monitoring: a short survey , 2011, AlMoDEP '11.

[5]  Karsten Schwan,et al.  Memory-efficient groupby-aggregate using compressed buffer trees , 2013, SoCC.

[6]  Anshul Jaiswal,et al.  Realtime Data Processing at Facebook , 2016, SIGMOD Conference.

[7]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[8]  Albert G. Greenberg,et al.  The cost of a cloud: research problems in data center networks , 2008, CCRV.

[9]  Mark S. Boddy,et al.  Anytime Problem Solving Using Dynamic Programming , 1991, AAAI.

[10]  Ramesh K. Sitaraman,et al.  Optimizing Grouped Aggregation in Geo-Distributed Streaming Analytics , 2015, HPDC.

[11]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[12]  Carlo Curino,et al.  Global Analytics in the Face of Bandwidth and Regulatory Constraints , 2015, NSDI.

[13]  Paramvir Bahl,et al.  Low Latency Geo-distributed Data Analytics , 2015, SIGCOMM.

[14]  Minlan Yu,et al.  Scheduling jobs across geo-distributed datacenters , 2015, SoCC.

[15]  Michael Isard,et al.  Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.

[16]  Carlo Curino,et al.  WANalytics: Geo-Distributed Analytics for a Data Intensive World , 2015, SIGMOD Conference.

[17]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[18]  Micah Adler,et al.  Algorithms for optimizing the bandwidth cost of content delivery , 2011, Comput. Networks.

[19]  B. Efron Better Bootstrap Confidence Intervals , 1987 .

[20]  Margo I. Seltzer,et al.  Network-Aware Operator Placement for Stream-Processing Systems , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[21]  Divesh Srivastava,et al.  Holistic UDAFs at streaming speeds , 2004, SIGMOD '04.

[22]  Carlo Curino,et al.  WANalytics: Analytics for a Geo-Distributed Data-Intensive World , 2015, CIDR.

[23]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[24]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[25]  Scott Shenker,et al.  Adaptive Stream Processing using Dynamic Batch Sizing , 2014, SoCC.

[26]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[27]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[28]  Per-Åke Larson,et al.  Data reduction by partial preaggregation , 2002, Proceedings 18th International Conference on Data Engineering.

[29]  Zhengping Qian,et al.  TimeStream: reliable stream computation in the cloud , 2013, EuroSys '13.

[30]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[31]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[32]  Jimmy J. Lin,et al.  Summingbird: A Framework for Integrating Batch and Online MapReduce Computations , 2014, Proc. VLDB Endow..

[33]  Michael J. Freedman,et al.  Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area , 2014, NSDI.

[34]  Minos N. Garofalakis,et al.  Scalable Approximate Query Tracking over Highly Distributed Data Streams , 2016, SIGMOD Conference.