The past few years have witnessed an unparalleled surge in both structured and unstructured data being generated by heterogeneous sources. These sources vary from scientific computations and sensor network deployments to high frequency financial markets and Web 2.0 applications. This has in tandem engendered an entire ecosystem of high-level computation frameworks targeting batch, streaming, iterative, and incremental applications, which abstract away the details of distributed computation underneath simple APIs. In a similar vein, the need to store datasets pre-and post-analysis has led to innovation in traditional DBMS, NoSQL stores, and distributed file systems. On the infrastructure side, due to economies of scale, the computation and storage model is supported by commodity off-the-shelf hardware. A key goal of organizations which crunch these datasets, is to get timely results with sub-second latency. This real-time computation requirement is naturally fulfilled by stream processing systems, which enable analysis on data in motion. IBM InfoSphere Streams is an award-winning product in this domain, enabling line-rate processing of real-time data streams for an extremely wide range of analytics because it both provides out of the box analytic capabilities and allows for easy extension with Java and C++ logic. The underlying engine is supported by the Stream Processing Language (SPL), which allows practitioners to define complex analytic applications using very simple constructs. This combined with a custom Integrated Development Environment (IDE) facilitates a rich ecosystem for stream processing where the practitioner primarily focuses on the desired analytics while the platform does all the heavy lifting. In this paper, we compare the performance of IBM InfoSphere Streams against Apache Storm [1], a leading open source alternative, to augment existing literature [2]. To this end, we implemented a real-world stream processing application, which enables email classification for online spam detection [3] on both platforms. Our goal was to analyze both the quantitative differences in performance as well as the qualitative differences in application writing and framework tuning. Similar to other studies [4, 5], we employed CPU time and throughput as primary metrics to compare the efficacy of both systems. Overall, our results show that for the application benchmark documented in this paper, Streams outperforms Storm by 2.6 to 12.3 times in terms of throughput while simultaneously consuming 5.5 to 14.2 times less CPU time. The measurements we made and the particular system environments used are described in this paper. Notably, the outcomes from this study show that the CPU …
[1]
Nathaniel S. Borenstein,et al.
Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies
,
1996,
RFC.
[2]
Yiming Yang,et al.
Introducing the Enron Corpus
,
2004,
CEAS.
[3]
Blaine Nelson,et al.
Analyzing Behavioral Features for Email Classification
,
2005,
CEAS.
[4]
Philip S. Yu,et al.
Finding "Who Is Talking to Whom" in VoIP Networks via Progressive Stream Clustering
,
2006,
Sixth International Conference on Data Mining (ICDM'06).
[5]
Vangelis Metsis,et al.
Spam Filtering with Naive Bayes - Which Naive Bayes?
,
2006,
CEAS.
[7]
Alain Biem,et al.
IBM infosphere streams for scalable, real-time, intelligent transportation services
,
2010,
SIGMOD Conference.
[8]
Alain Biem,et al.
A streaming approach to radio astronomy imaging
,
2010,
2010 IEEE International Conference on Acoustics, Speech and Signal Processing.
[9]
Nesime Tatbul,et al.
Large-Scale DNA Sequence Analysis in the Cloud: A Stream-Based Approach
,
2011,
Euro-Par Workshops.
[10]
David H. Reiley,et al.
The Economics of Spam
,
2012
.
[11]
Anirban Dasgupta,et al.
Impact of Spam Exposure on User Engagement
,
2012,
USENIX Security Symposium.
[12]
Toyotaro Suzumura,et al.
A Performance Analysis of System S, S4, and Esper via Two Level Benchmarking
,
2013,
QEST.
[13]
Daniel Mills,et al.
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
,
2013,
Proc. VLDB Endow..
[14]
Scott Shenker,et al.
Discretized streams: fault-tolerant streaming computation at scale
,
2013,
SOSP.