Frugal topology construction for stream aggregation in the cloud

Aggregation of streamed data is key to the expansion of the Internet of Things. This paper addresses the problem of designing a topology for reliably aggregating data flows from many devices arriving at a datacenter. Reliability here means ensuring operation without data loss. We seek a frugal solution that prevents wasteful resource consumption (over-provisioning). This problem is salient when building an aggregation service out of components (here aggregation nodes) that exhibit hard constraints on the amount of information they can handle per unit of time. We first formalize the problem and provide an analysis of the relation between monitored devices (plus information they send), and the operations performed at aggregation nodes, in terms of data rates. Building on this rate analysis, we devise a novel algorithm, which we call CSA, that basically outputs an aggregation topology capable of handling those incoming data rates, preventing thereby empirical trial-and-error design. We analyze the algorithm, before validating it on the Amazon Kinesis platform, using a device dataset from a European telco operator.

[1]  Michael J. Franklin,et al.  On-the-fly sharing for streamed aggregation , 2006, SIGMOD Conference.

[2]  Odysseas Papapetrou,et al.  Sketch-based Querying of Distributed Sliding-Window Data Streams , 2012, Proc. VLDB Endow..

[3]  Deep Ganguli,et al.  Druid: a real-time analytical data store , 2014, SIGMOD Conference.

[4]  Patrick P. C. Lee,et al.  LD-Sketch: A distributed sketching design for accurate and scalable anomaly detection in network data streams , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[5]  Beng Chin Ooi,et al.  Multiple aggregations over data streams , 2005, SIGMOD '05.

[6]  Qi Zhang,et al.  Approximate Clustering on Distributed Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[7]  Bao-Duy Tran,et al.  Systematic Approach to Multi-layer Parallelisation of Time-based Stream Aggregation under Ingest Constraints in the Cloud , 2014 .

[8]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[9]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[10]  Donald F. Towsley,et al.  Distributed Resource Management and Admission Control of Stream Processing Systems with Max Utility , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[11]  Gabriel Scalosub,et al.  Buffer Management for Aggregated Streaming Data with Packet Dependencies , 2010, IEEE Transactions on Parallel and Distributed Systems.

[12]  Anees Shaikh,et al.  Programming your network at run-time for big data applications , 2012, HotSDN '12.

[13]  Young-Bae Ko,et al.  Performance Improvement of IEEE 802.15.4 Beacon-Enabled WPAN with Superframe Adaptation Via Traffic Indication , 2007, Networking.

[14]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[15]  Chris Jermaine,et al.  Online aggregation for large MapReduce jobs , 2011, Proc. VLDB Endow..

[16]  Haifeng Jiang,et al.  Photon: fault-tolerant and scalable joining of continuous data streams , 2013, SIGMOD '13.

[17]  Rajeev Rastogi,et al.  Memory-constrained aggregate computation over data streams , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[18]  Divesh Srivastava,et al.  Finding Hierarchical Heavy Hitters in Data Streams , 2003, VLDB.

[19]  Alastair R. Beresford,et al.  Device analyzer: large-scale mobile data collection , 2014, PERV.

[20]  Dawn Xiaodong Song,et al.  Secure hierarchical in-network aggregation in sensor networks , 2006, CCS '06.

[21]  Ratul Mahajan,et al.  Bolt: Data Management for Connected Homes , 2014, NSDI.

[22]  Sharma Chakravarthy,et al.  Stream Data Processing: A Quality of Service Perspective - Modeling, Scheduling, Load Shedding, and Complex Event Processing , 2009, Advances in Database Systems.

[23]  Jaideep Chandrashekar,et al.  Characterizing home wireless performance: The gateway view , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[24]  Ion Stoica,et al.  Sharing aggregate computation for distributed queries , 2007, SIGMOD '07.

[25]  Sartaj Sahni,et al.  Network Topology Optimization for Data Aggregation , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[26]  Boaz Patt-Shamir,et al.  Overflow management with multipart packets , 2011, 2011 Proceedings IEEE INFOCOM.

[27]  T. Kohno,et al.  Remote physical device fingerprinting , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[28]  Michael Isard,et al.  Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.

[29]  Stanley B. Zdonik,et al.  Window-aware load shedding for aggregation queries over data streams , 2006, VLDB.