DSPBench: A Suite of Benchmark Applications for Distributed Data Stream Processing Systems

Systems enabling the continuous processing of large data streams have recently attracted the attention of the scientific community and industrial stakeholders. Data Stream Processing Systems (DSPSs) are complex and powerful frameworks able to ease the development of streaming applications in distributed computing environments like clusters and clouds. Several systems of this kind have been released and currently maintained as open source projects, like Apache Storm and Spark Streaming. Some benchmark applications have often been used by the scientific community to test and evaluate new techniques to improve the performance and usability of DSPSs. However, the existing benchmark suites lack of representative workloads coming from the wide set of application domains that can leverage the benefits offered by the stream processing paradigm in terms of near real-time performance. The goal of this article is to present a new benchmark suite composed of 15 applications coming from areas like Finance, Telecommunications, Sensor Networks, Social Networks and others. This article describes in detail the nature of these applications, their full workload characterization in terms of selectivity, processing cost, input size and overall memory occupation. In addition, it exemplifies the usefulness of our benchmark suite to compare real DSPSs by selecting Apache Storm and Spark Streaming for this analysis.

[1]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[2]  Gang Wu,et al.  Stream Bench: Towards Benchmarking Modern Distributed Stream Computing Frameworks , 2014, 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing.

[3]  Rajeev Motwani,et al.  Chain: operator scheduling for memory minimization in data stream systems , 2003, SIGMOD '03.

[4]  Choonhwa Lee,et al.  Benchmarking Tool for Modern Distributed Stream Processing Engines , 2019, 2019 International Conference on Information Networking (ICOIN).

[5]  Christoph Matthies,et al.  Senska - Towards an Enterprise Streaming Benchmark , 2017, TPCTC.

[6]  Rong Yan,et al.  Adaptive Multimedia Mining on Distributed Stream Processing Systems , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[7]  Deepak S. Turaga,et al.  Processing 6 billion CDRs/day: from research to production (experience report) , 2012, DEBS.

[8]  Bingsheng He,et al.  Revisiting the Design of Data Stream Processing Systems on Multi-Core Processors , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[9]  Matthias Weidlich,et al.  Scalable stateful stream processing for smart grids , 2014, DEBS '14.

[10]  Anna Fensel,et al.  On the application of Big Data in future large-scale intelligent Smart City installations , 2014, Int. J. Pervasive Comput. Commun..

[11]  Saverio Niccolini,et al.  Scaling Out the Performance of Service Monitoring Applications with BlockMon , 2013, PAM.

[12]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[13]  Otto Carlos Muniz Bandeira Duarte,et al.  A Performance Comparison of Open-Source Stream Processing Platforms , 2016, 2016 IEEE Global Communications Conference (GLOBECOM).

[14]  Hamid Nasiri,et al.  Evaluation of distributed stream processing frameworks for IoT applications in Smart Cities , 2019, Journal of Big Data.

[15]  Andrew A. Chien,et al.  Exascale workload characterization and architecture implications , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[16]  Philip S. Yu,et al.  Scale-Up Strategies for Processing High-Rate Data Streams in System S , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[17]  Carlo Zaniolo,et al.  Minimizing latency and memory in DSMS: a unified approach to quasi-optimal scheduling , 2008, SSPS '08.

[18]  Giuseppe Bianchi,et al.  On-demand time-decaying bloom filters for telemarketer detection , 2011, CCRV.

[19]  Yangjun Wang,et al.  Stream Processing Systems Benchmark: StreamBench , 2017 .

[20]  Doo-Hwan Bae,et al.  An Approach to Outlier Detection of Software Measurement Data using the K-means Clustering Method , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[21]  Alain Biem,et al.  IBM infosphere streams for scalable, real-time, intelligent transportation services , 2010, SIGMOD Conference.

[22]  Katharina Morik,et al.  Heterogeneous Stream Processing and Crowdsourcing for Urban Traffic Management , 2014, EDBT.

[23]  Toyotaro Suzumura,et al.  A performance study on operator-based stream processing systems , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[24]  E. Bouillet,et al.  Of Streams and Storms , 2014 .

[25]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[26]  Hakan Ferhatosmanoglu,et al.  Aggregate Profile Clustering for Telco Analytics , 2013, Proc. VLDB Endow..

[27]  Alexandre M. Bayen,et al.  Scaling the mobile millennium system in the cloud , 2011, SoCC.

[28]  Kun-Lung Wu,et al.  From a stream of relational queries to distributed stream processing , 2010, Proc. VLDB Endow..

[29]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[30]  Martin Grund,et al.  Big data analytics on high Velocity streams: A case study , 2013, 2013 IEEE International Conference on Big Data.

[31]  Holger Pirk,et al.  LightSaber: Efficient Window Aggregation on Multi-core Processors , 2020, SIGMOD Conference.

[32]  Odej Kao,et al.  Processing smart meter data streams in the cloud , 2011, 2011 2nd IEEE PES International Conference and Exhibition on Innovative Smart Grid Technologies.

[33]  Tilmann Rabl,et al.  Grizzly: Efficient Stream Processing Through Adaptive Query Compilation , 2020, SIGMOD Conference.

[34]  Fan Yang,et al.  NIM: Scalable Distributed Stream Process System on Mobile Network Data , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[35]  Zhengping Qian,et al.  TimeStream: reliable stream computation in the cloud , 2013, EuroSys '13.

[36]  Nick Koudas,et al.  TwitterMonitor: trend detection over the twitter stream , 2010, SIGMOD Conference.

[37]  Sebastian Michel,et al.  Scalable, continuous tracking of tag co-occurrences between short sets using (almost) disjoint tag partitions , 2013, DBSocial '13.

[38]  Xiaohui Yu,et al.  Pollux: towards scalable distributed real-time search on microblogs , 2013, EDBT '13.

[39]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[40]  Sharma Chakravarthy,et al.  Stream Data Processing: A Quality of Service Perspective - Modeling, Scheduling, Load Shedding, and Complex Event Processing , 2009, Advances in Database Systems.

[41]  Abhinav Srivastava,et al.  Credit Card Fraud Detection Using Hidden Markov Model , 2008, IEEE Transactions on Dependable and Secure Computing.

[42]  Sang Hyuk Son,et al.  Prediction-Based QoS Management for Real-Time Data Streams , 2006, 2006 27th IEEE International Real-Time Systems Symposium (RTSS'06).

[43]  Bugra Gedik,et al.  Fundamentals of Stream Processing: Application Design, Systems, and Analytics , 2014 .

[44]  Marin Litoiu,et al.  Distributed, application-level monitoring for heterogeneous clouds using stream processing , 2013, Future Gener. Comput. Syst..

[45]  Yogesh L. Simmhan,et al.  RIoTBench: An IoT benchmark for distributed stream processing systems , 2017, Concurr. Comput. Pract. Exp..

[46]  Toyotaro Suzumura,et al.  Automatic optimization of stream programs via source program operator graph transformations , 2013, Distributed and Parallel Databases.

[47]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[48]  Badrish Chandramouli,et al.  Accurate latency estimation in a distributed event processing system , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[49]  Raul Castro Fernandez,et al.  Integrating scale out and fault tolerance in stream processing using operator state management , 2013, SIGMOD '13.

[50]  Brian Taylor,et al.  Systematically retrieving research in the digital age: Case study on the topic of social networking sites and young people’s mental health , 2014, J. Inf. Sci..

[51]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[52]  Shaiful Alam Chowdhury,et al.  Performance Evaluation of Yahoo! S4: A First Look , 2012, 2012 Seventh International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[53]  Michael Stonebraker,et al.  Linear Road: A Stream Data Management Benchmark , 2004, VLDB.

[54]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[55]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[56]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[57]  Paolo Bellavista,et al.  Design and Implementation of a Scalable and QoS-aware Stream Processing Framework: The Quasit Prototype , 2012, 2012 IEEE International Conference on Green Computing and Communications.

[58]  Jennifer Widom,et al.  STREAM: the stanford stream data manager (demonstration description) , 2003, SIGMOD '03.

[59]  Jeyhun Karimov,et al.  Analyzing Efficient Stream Processing on Modern Hardware , 2019, Proc. VLDB Endow..

[60]  Xiaona Li,et al.  BigDataBench: a Big Data Benchmark Suite from Web Search Engines , 2013, ArXiv.

[61]  Sandra Geisler,et al.  Evaluation of Real-Time Traffic Applications Based on Data Stream Mining , 2014 .

[62]  Alexander Schill,et al.  Stream-Based Recommendation for Enterprise Social Media Streams , 2013, BIS.

[63]  Roberto Baldoni,et al.  Adaptive online scheduling in storm , 2013, DEBS.

[64]  Toyotaro Suzumura,et al.  A Performance Analysis of System S, S4, and Esper via Two Level Benchmarking , 2013, QEST.

[65]  Jeyhun Karimov,et al.  Benchmarking Distributed Stream Data Processing Systems , 2019, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[66]  Junjie Yao,et al.  TeRec: A Temporal Recommender System Over Tweet Stream , 2013, Proc. VLDB Endow..

[67]  Yogesh L. Simmhan,et al.  Adaptive rate stream processing for smart grid applications on clouds , 2011, ScienceCloud '11.

[68]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[69]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[70]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[71]  Wagner Meira,et al.  Watershed: A High Performance Distributed Stream Processing System , 2011, 2011 23rd International Symposium on Computer Architecture and High Performance Computing.

[72]  Marco Canini,et al.  Mitigating Network Side Channel Leakage for Stream Processing Systems in Trusted Execution Environments , 2018, DEBS.

[73]  Badrish Chandramouli,et al.  StreamRec: a real-time recommender system , 2011, SIGMOD '11.

[74]  Claudio Soriente,et al.  StreamCloud: An Elastic and Scalable Data Streaming System , 2012, IEEE Transactions on Parallel and Distributed Systems.