No . 2013 : 11 Concurrent Data Structures for Efficient Streaming Aggregation

In this paper, we study the problem of streaming multiway aggregation, where large data volumes are received from multiple input streams. Multiway aggregation is a fundamental computational component in data stream management systems, requiring lowlatency and high throughput solutions. We focus on the problem of designing concurrent data structures for achieving low-latency and high throughput multiway aggregation; an issue that has been overlooked in the literature. Data structures act as articulation points between pipeline stages of streaming aggregation, supporting parallelism and balancing the work between the aggregate stages. By studying the use and limitations of existing aggregate designs and the data structures they use, we motivate the need for new ones. We present two new concurrent data structures and their lock-free linearizable implementations, supporting both order-sensitive and order-insensitive aggregate functions. We also present an extensive evaluation, based on a variety of aggregation queries on two large datasets, one with data extracted from SoundCloud, a music social network, and one with data from a smart grid metering network. In all the experiments, the proposed data structures and methods improved the aggregation performance significantly, up to one order of magnitude, in terms of both processing throughput and latency over the commonly-used techniques based on queues.

[1]  Taskin Koçak,et al.  Smart Grid Technologies: Communication Technologies and Standards , 2011, IEEE Transactions on Industrial Informatics.

[2]  Angelos Bilas,et al.  Understanding and improving the cost of scaling distributed event processing , 2012, DEBS.

[3]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[4]  Kun-Lung Wu,et al.  Elastic scaling of data parallel operators in stream processing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[5]  Michael Stonebraker,et al.  Fault-tolerance in the borealis distributed stream processing system , 2008, ACM Trans. Database Syst..

[6]  Philip S. Yu,et al.  CellJoin: a parallel stream join operator for the cell processor , 2009, The VLDB Journal.

[7]  Jennifer Widom,et al.  Flexible time management in data stream systems , 2004, PODS.

[8]  Philippas Tsigas,et al.  Fast and lock-free concurrent priority queues for multi-thread systems , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[9]  Tim Kraska,et al.  Stormy: an elastic and highly available streaming service in the cloud , 2012, EDBT-ICDT '12.

[10]  Michael Stonebraker,et al.  The 8 requirements of real-time stream processing , 2005, SGMD.

[11]  Alessandro Margara,et al.  Low latency complex event processing on parallel hardware , 2012, J. Parallel Distributed Comput..

[12]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[13]  M. Tamer Özsu,et al.  Adaptive input admission and management for parallel stream processing , 2013, DEBS.

[14]  Kun-Lung Wu,et al.  Evaluation of streaming aggregation on parallel hardware architectures , 2010, DEBS '10.

[15]  Claudio Soriente,et al.  StreamCloud: An Elastic and Scalable Data Streaming System , 2012, IEEE Transactions on Parallel and Distributed Systems.

[16]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[17]  Mohamed A. Sharaf,et al.  Three-Level Processing of Multiple Aggregate Continuous Queries , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[18]  Maged M. Michael,et al.  High performance dynamic lock-free hash tables and list-based sets , 2002, SPAA '02.

[19]  Maged M. Michael The Balancing Act of Choosing Nonblocking Features , 2013, ACM Queue.

[20]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .