Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area

We present JetStream, a system that allows real-time analysis of large, widely-distributed changing data sets. Traditional approaches to distributed analytics require users to specify in advance which data is to be backhauled to a central location for analysis. This is a poor match for domains where available bandwidth is scarce and it is infeasible to collect all potentially useful data. JetStream addresses bandwidth limits in two ways, both of which are explicit in the programming model. The system incorporates structured storage in the form of OLAP data cubes, so data can be stored for analysis near where it is generated. Using cubes, queries can aggregate data in ways and locations of their choosing. The system also includes adaptive filtering and other transformations that adjusts data quality to match available bandwidth. Many bandwidth-saving transformations are possible; we discuss which are appropriate for which data and how they can best be combined. We implemented a range of analytic queries on web request logs and image data. Queries could be expressed in a few lines of code. Using structured storage on source nodes conserved network bandwidth by allowing data to be collected only when needed to fulfill queries. Our adaptive control mechanisms are responsive enough to keep end-to-end latency within a few seconds, even when available bandwidth drops by a factor of two, and are flexible enough to express practical policies.

[1]  M. Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[2]  Wei Hong,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Tag: a Tiny Aggregation Service for Ad-hoc Sensor Networks , 2022 .

[3]  Mark Handley,et al.  Congestion control for high bandwidth-delay product networks , 2002, SIGCOMM '02.

[4]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[5]  David Maier,et al.  Exploiting Punctuation Semantics in Continuous Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[6]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[9]  Yixin Chen,et al.  Stream Cube: An Architecture for Multi-Dimensional Analysis of Data Streams , 2005, Distributed and Parallel Databases.

[10]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[11]  Yoonho Park,et al.  SPC: a distributed, scalable platform for data mining , 2006, DMSSP '06.

[12]  Margo I. Seltzer,et al.  Network-Aware Operator Placement for Stream-Processing Systems , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[13]  Jeong-Hyon Hwang,et al.  Fast and Reliable Stream Processing over Wide Area Networks , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[14]  Liang Dong,et al.  Real-Time Video Relay for UAV Traffic Surveillance Systems Through Available Communication Networks , 2007, 2007 IEEE Wireless Communications and Networking Conference.

[15]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[16]  Songqing Chen,et al.  The stretched exponential distribution of internet media access patterns , 2008, PODC '08.

[17]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[18]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[19]  Michael J. Franklin,et al.  Continuous Analytics: Rethinking Query Processing in a Network-Effect World , 2009, CIDR.

[20]  Ramesh Govindan,et al.  RCRT: Rate-controlled reliable transport protocol for wireless sensor networks , 2010, TOSN.

[21]  Michael J. Freedman,et al.  Experiences with CoralCDN: A Five-Year Operational View , 2010, NSDI.

[22]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX Annual Technical Conference.

[23]  VICCI: A Programmable Cloud-Computing Research Testbed , 2011 .

[24]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[25]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[26]  P. Agarwal,et al.  Mergeable summaries , 2013, TODS.

[27]  Michael J. Freedman,et al.  Making Every Bit Count in Wide-Area Analytics , 2013, HotOS.

[28]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[29]  Michael Isard,et al.  Differential Dataflow , 2013, CIDR.