Revisiting the Design of Data Stream Processing Systems on Multi-Core Processors

Driven by the rapidly increasing demand for handling real-time data streams, many data stream processing (DSP) systems have been proposed. Regardless of the different architectures of those DSP systems, they are mostly aiming at scaling out using a cluster of commodity machines and built around a number of key design aspects: a) pipelined processing with message passing, b) on-demand data parallelism, and c) JVM based implementation. However, there lacks a study on those key design aspects on modern scale-up architectures, where more CPU cores are being put on the same die, and the onchip cache hierarchies are getting larger, deeper, and complex. Multiple sockets bring non-uniform memory access (NUMA) effort. In this paper, we revisit the aforementioned design aspects on a modern scale-up server. Specifically, we use a series of applications as micro benchmark to conduct detailed profiling studies on Apache Storm and Flink. From the profiling results, we observe two major performance issues: a) the massively parallel execution model causes serious front-end stalls, which are a major performance bottleneck issue on a single CPU socket, b) the lack of NUMA-aware mechanism causes major drawback on the scalability of DSP systems on multi-socket architectures. Addressing these issues should allow DSP systems to exploit modern scale-up architectures, which also benefits scaling out environments. We present our initial efforts on resolving the above-mentioned performance issues, which have shown up to 3.2x and 3.1x improvement on the performance of Storm and Flink, respectively.

[1]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[2]  Peter Lake,et al.  In-Memory Databases , 2013 .

[3]  Beng Chin Ooi,et al.  In-memory Databases: Challenges and Opportunities From Software and Hardware Perspectives , 2015, SGMD.

[4]  Gustavo Alonso,et al.  Deployment of Query Plans on Multicores , 2014, Proc. VLDB Endow..

[5]  Jignesh M. Patel,et al.  Profiling R on a Contemporary Processor , 2014, Proc. VLDB Endow..

[6]  Jim Gray,et al.  Benchmark Handbook: For Database and Transaction Processing Systems , 1992 .

[7]  Giuseppe Bianchi,et al.  On-demand time-decaying bloom filters for telemarketer detection , 2011, CCRV.

[8]  Gustavo Alonso,et al.  Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[9]  Vladimir Vlassov,et al.  Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.

[10]  Kenneth A. Ross,et al.  Buffering databse operations for enhanced instruction cache performance , 2004, SIGMOD '04.

[11]  Dorit S. Hochbaum,et al.  A Polynomial Algorithm for the k-cut Problem for Fixed k , 1994, Math. Oper. Res..

[12]  Mohammad Hosseini,et al.  R-Storm: Resource-Aware Scheduling in Storm , 2015, Middleware.

[13]  Shaiful Alam Chowdhury,et al.  Performance Evaluation of Yahoo! S4: A First Look , 2012, 2012 Seventh International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[14]  Michael Stonebraker,et al.  Linear Road: A Stream Data Management Benchmark , 2004, VLDB.

[15]  Ippokratis Pandis,et al.  NUMA-aware algorithms: the case of data shuffling , 2013, CIDR.

[16]  R. Srikant,et al.  Scheduling Storms and Streams in the Cloud , 2015, SIGMETRICS.

[17]  Martin L. Kersten,et al.  Database Architecture Optimized for the New Bottleneck: Memory Access , 1999, VLDB.

[18]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[19]  Xing Xie,et al.  Mining interesting locations and travel sequences from GPS trajectories , 2009, WWW '09.

[20]  David Detlefs,et al.  Garbage-first garbage collection , 2004, ISMM '04.

[21]  Michael Stonebraker,et al.  Aurora: a new model and architecture for data stream management , 2003, The VLDB Journal.

[22]  Roberto Baldoni,et al.  Adaptive online scheduling in storm , 2013, DEBS.

[23]  Anastasia Ailamaki,et al.  Improving instruction cache performance in OLTP , 2006, TODS.

[24]  Malu Castellanos,et al.  Building a Transparent Batching Layer for Storm , 2014 .

[25]  Beng Chin Ooi,et al.  In-Memory Big Data Management and Processing: A Survey , 2015, IEEE Transactions on Knowledge and Data Engineering.

[26]  Navendu Jain,et al.  Design, implementation, and evaluation of the linear road bnchmark on the stream processing core , 2006, SIGMOD Conference.

[27]  Bingsheng He,et al.  Cache-Conscious Automata for XML Filtering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[28]  Anastasia Ailamaki,et al.  A Case for Staged Database Systems , 2003, CIDR.

[29]  Lieven Eeckhout,et al.  Performance Evaluation and Benchmarking , 2005 .

[30]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[31]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[32]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[33]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[34]  David J. DeWitt,et al.  DBMSs on a Modern Processor: Where Does Time Go? , 1999, VLDB.

[35]  Jian Tang,et al.  T-Storm: Traffic-Aware Online Scheduling in Storm , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[36]  Viktor Leis,et al.  Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age , 2014, SIGMOD Conference.