论文信息 - Revisiting the Design of Data Stream Processing Systems on Multi-Core Processors

Revisiting the Design of Data Stream Processing Systems on Multi-Core Processors

Driven by the rapidly increasing demand for handling real-time data streams, many data stream processing (DSP) systems have been proposed. Regardless of the different architectures of those DSP systems, they are mostly aiming at scaling out using a cluster of commodity machines and built around a number of key design aspects: a) pipelined processing with message passing, b) on-demand data parallelism, and c) JVM based implementation. However, there lacks a study on those key design aspects on modern scale-up architectures, where more CPU cores are being put on the same die, and the onchip cache hierarchies are getting larger, deeper, and complex. Multiple sockets bring non-uniform memory access (NUMA) effort. In this paper, we revisit the aforementioned design aspects on a modern scale-up server. Specifically, we use a series of applications as micro benchmark to conduct detailed profiling studies on Apache Storm and Flink. From the profiling results, we observe two major performance issues: a) the massively parallel execution model causes serious front-end stalls, which are a major performance bottleneck issue on a single CPU socket, b) the lack of NUMA-aware mechanism causes major drawback on the scalability of DSP systems on multi-socket architectures. Addressing these issues should allow DSP systems to exploit modern scale-up architectures, which also benefits scaling out environments. We present our initial efforts on resolving the above-mentioned performance issues, which have shown up to 3.2x and 3.1x improvement on the performance of Storm and Flink, respectively.

[1] Leonardo Neumeyer,et al. S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[2] Peter Lake,et al. In-Memory Databases , 2013 .

[3] Beng Chin Ooi,et al. In-memory Databases: Challenges and Opportunities From Software and Hardware Perspectives , 2015, SGMD.

[4] Gustavo Alonso,et al. Deployment of Query Plans on Multicores , 2014, Proc. VLDB Endow..

[5] Jignesh M. Patel,et al. Profiling R on a Contemporary Processor , 2014, Proc. VLDB Endow..

[6] Jim Gray,et al. Benchmark Handbook: For Database and Transaction Processing Systems , 1992 .

[7] Giuseppe Bianchi,et al. On-demand time-decaying bloom filters for telemarketer detection , 2011, CCRV.

[8] Gustavo Alonso,et al. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[9] Vladimir Vlassov,et al. Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.

[10] Kenneth A. Ross,et al. Buffering databse operations for enhanced instruction cache performance , 2004, SIGMOD '04.

[11] Dorit S. Hochbaum,et al. A Polynomial Algorithm for the k-cut Problem for Fixed k , 1994, Math. Oper. Res..

[12] Mohammad Hosseini,et al. R-Storm: Resource-Aware Scheduling in Storm , 2015, Middleware.

[13] Shaiful Alam Chowdhury,et al. Performance Evaluation of Yahoo! S4: A First Look , 2012, 2012 Seventh International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[14] Michael Stonebraker,et al. Linear Road: A Stream Data Management Benchmark , 2004, VLDB.

[15] Ippokratis Pandis,et al. NUMA-aware algorithms: the case of data shuffling , 2013, CIDR.

[16] R. Srikant,et al. Scheduling Storms and Streams in the Cloud , 2015, SIGMETRICS.

[17] Martin L. Kersten,et al. Database Architecture Optimized for the New Bottleneck: Memory Access , 1999, VLDB.

[18] Frederick Reiss,et al. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[19] Xing Xie,et al. Mining interesting locations and travel sequences from GPS trajectories , 2009, WWW '09.

[20] David Detlefs,et al. Garbage-first garbage collection , 2004, ISMM '04.

[21] Michael Stonebraker,et al. Aurora: a new model and architecture for data stream management , 2003, The VLDB Journal.

[22] Roberto Baldoni,et al. Adaptive online scheduling in storm , 2013, DEBS.

[23] Anastasia Ailamaki,et al. Improving instruction cache performance in OLTP , 2006, TODS.

[24] Malu Castellanos,et al. Building a Transparent Batching Layer for Storm , 2014 .

[25] Beng Chin Ooi,et al. In-Memory Big Data Management and Processing: A Survey , 2015, IEEE Transactions on Knowledge and Data Engineering.

[26] Navendu Jain,et al. Design, implementation, and evaluation of the linear road bnchmark on the stream processing core , 2006, SIGMOD Conference.

[27] Bingsheng He,et al. Cache-Conscious Automata for XML Filtering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[28] Anastasia Ailamaki,et al. A Case for Staged Database Systems , 2003, CIDR.

[29] Lieven Eeckhout,et al. Performance Evaluation and Benchmarking , 2005 .

[30] Scott Shenker,et al. Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[31] Ying Xing,et al. The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[32] Frederick Reiss,et al. TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[33] Qiang Chen,et al. Aurora : a new model and architecture for data stream management ) , 2006 .

[34] David J. DeWitt,et al. DBMSs on a Modern Processor: Where Does Time Go? , 1999, VLDB.

[35] Jian Tang,et al. T-Storm: Traffic-Aware Online Scheduling in Storm , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[36] Viktor Leis,et al. Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age , 2014, SIGMOD Conference.