Stream Bench: Towards Benchmarking Modern Distributed Stream Computing Frameworks

While big data is becoming ubiquitous, interest in handling data stream at scale is also gaining popularity, which leads to the sprout of many distributed stream computing systems. However, complexity of stream computing and diversity of workloads expose great challenges to benchmark these systems. Due to lack of standard criteria, evaluations and comparisons of these systems tend to be difficult. This paper takes an early step towards benchmarking modern distributed stream computing frameworks. After identifying the challenges and requirements in the field, we raise our benchmark definition Stream Bench regarding the requirements. Stream Bench proposes a message system functioning as a mediator between stream data generation and consumption. It also covers 7 benchmark programs that intend to address typical stream computing scenarios and core operations. Not only does it care about performance of systems under different data scales, but also takes fault tolerance ability and durability into account, which drives to incorporate four workload suites targeting at these various aspects of systems. Finally, we illustrate the feasibility of Stream Bench by applying it to two popular frameworks, Apache Storm and Apache Spark Streaming. We draw comparisons from various perspectives between the two platforms with workload suites of Stream Bench. In addition, we also demonstrate performance improvement of Storm's latest version with the benchmark.

[1]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[2]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[3]  Michael Stonebraker,et al.  Linear Road: A Stream Data Management Benchmark , 2004, VLDB.

[4]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[5]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[6]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[7]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[8]  Zhengping Qian,et al.  TimeStream: reliable stream computation in the cloud , 2013, EuroSys '13.

[9]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[10]  Xiaona Li,et al.  BigDataBench: a Big Data Benchmark Suite from Web Search Engines , 2013, ArXiv.

[11]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[14]  Michael Stonebraker,et al.  The 8 requirements of real-time stream processing , 2005, SGMD.

[15]  Michael Stonebraker,et al.  The Aurora and Medusa Projects , 2003, IEEE Data Eng. Bull..