Thread Cooperation in Multicore Architectures for Frequency Counting over Multiple Data Streams

Many real-world data stream analysis applications such as network monitoring, click stream analysis, and others require combining multiple streams of data arriving from multiple sources. This is referred to as multi-stream analysis. To deal with high stream arrival rates, it is desirable that such systems be capable of supporting very high processing throughput. The advent of multicore processors and powerful servers driven by these processors calls for efficient parallel designs that can effectively utilize the parallelism of the multicores, since performance improvement is possible only through effective parallelism. In this paper, we address the problem of parallelizing multi-stream analysis in the context of multicore processors. Specifically, we concentrate on parallelizing frequent elements, top-k, and frequency counting over multiple streams. We discuss the challenges in designing an efficient parallel system for multi-stream processing. Our evaluation and analysis reveals that traditional "contention" based locking results in excessive overhead and wait, which in turn leads to severe performance degradation in modern multicore architectures. Based on our analysis, we propose a "cooperation" based locking paradigm for efficient parallelization of frequency counting. The proposed "cooperation" based paradigm removes waits associated with synchronization, and allows replacing locks by much cheaper atomic synchronization primitives. Our implementation of the proposed paradigm to parallelize a well known frequency counting algorithm shows the benefits of the proposed "cooperation" based locking paradigm when compared to the traditional "contention" based locking paradigm. In our experiments, the proposed "cooperation" based design outperforms the traditional "contention" based design by a factor of 2--5.5X for synthetic zipfian data sets.

[1]  Babak Falsafi,et al.  Database Servers on Chip Multiprocessors: Limitations and Opportunities , 2007, CIDR.

[2]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[3]  Shyam Antony,et al.  CoTS: A Scalable Framework for Parallelizing Frequency Counting over Data Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[4]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[5]  Maged M. Michael Safe memory reclamation for dynamic lock-free objects using atomic reads and writes , 2002, PODC '02.

[6]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[7]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[8]  Ambuj K. Singh,et al.  A unified framework for monitoring data streams in real time , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  Divyakant Agrawal,et al.  On Hit Inflation Techniques and Detection in Streams of Web Advertising Networks , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[10]  Philip S. Yu,et al.  Executing Stream Joins on the Cell Processor , 2007, VLDB.

[11]  Patrick E. O'Neil,et al.  The Escrow transactional method , 1986, TODS.

[12]  Dimitrios Gunopulos,et al.  Ad-hoc Top-k Query Answering for Data Streams , 2007, VLDB.

[13]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[14]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[15]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[16]  William N. Scherer,et al.  Scalable synchronous queues , 2006, PPoPP '06.

[17]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[18]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[19]  Michael Stonebraker,et al.  Load management and high availability in the Medusa distributed stream processing system , 2004, SIGMOD '04.

[20]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[21]  Nir Shavit,et al.  Split-ordered lists: lock-free extensible hash tables , 2003, PODC '03.

[22]  Jennifer Widom,et al.  Query Processing, Resource Management, and Approximation ina Data Stream Management System , 2002 .

[23]  Marcin Zukowski,et al.  Vectorized data processing on the cell broadband engine , 2007, DaMoN '07.

[24]  Soumya Edamana Mana,et al.  Split-Ordered Lists : Lock-Free Extensible Hash Tables , 2011 .

[25]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[26]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[27]  Dawn Xiaodong Song,et al.  New Streaming Algorithms for Fast Detection of Superspreaders , 2005, NDSS.

[28]  Divyakant Agrawal,et al.  An integrated efficient solution for computing frequent and top-k elements in data streams , 2006, TODS.

[29]  Dinesh Manocha,et al.  Fast and approximate stream mining of quantiles and frequencies using graphics processors , 2005, SIGMOD '05.

[30]  Kyriakos Mouratidis,et al.  Continuous monitoring of top-k queries over sliding windows , 2006, SIGMOD Conference.

[31]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[32]  Michael Stonebraker,et al.  OLTP through the looking glass, and what we found there , 2008, SIGMOD Conference.