SPMgr: Dynamic workflow manager for sampling and filtering data streams over Apache Storm

In this article, we address dynamic workflow management for sampling and filtering data streams in Apache Storm. As many sensors generate data streams continuously, we often use sampling to choose some representative data or filtering to remove unnecessary data. Apache Storm is a real-time distributed processing platform suitable for handling large data streams. Storm, however, must stop the entire work when it changes the input data structure or processing algorithm as it needs to modify, redistribute, and restart the programs. In addition, for effective data processing, we often use Storm with Kafka and databases, but it is difficult to use these platforms in an integrated manner. In this article, we derive the problems when applying sampling and filtering algorithms to Storm and propose a dynamic workflow management model that solves these problems. First, we present the concept of a plan consisting of input, processing, and output modules of a data stream. Second, we propose Storm Plan Manager, which can operate Storm, Kafka, and database as a single integrated system. Storm Plan Manager is an integrated workflow manager that dynamically controls sampling and filtering of data streams through plans. Third, as a key feature, Storm Plan Manager provides a Web client interface to visually create, execute, and monitor plans. In this article, we show the usefulness of the proposed Storm Plan Manager by presenting its design, implementation, and experimental results in order.

[1]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[2]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[3]  Edith Cohen,et al.  Stream Sampling for Frequency Cap Statistics , 2015, KDD.

[4]  Myeong-Seon Gil,et al.  Variable size sampling to support high uniformity confidence in sensor data streams , 2018, Int. J. Distributed Sens. Networks.

[5]  Myeong-Seon Gil,et al.  Storm-based distributed sampling system for multi-source stream environment , 2018, Int. J. Distributed Sens. Networks.

[6]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[7]  Jacques Demerjian,et al.  Sampling algorithms in data stream environments , 2016, 2016 International Conference on Digital Economy (ICDEc).

[8]  Sunil Prabhakar,et al.  Filtering Data Streams for Entity-Based Continuous Queries , 2010, IEEE Transactions on Knowledge and Data Engineering.

[9]  Theodore Johnson,et al.  Sampling algorithms in a stream operator , 2005, SIGMOD '05.

[10]  Qin Zhang,et al.  Optimal sampling from distributed streams , 2010, PODS '10.

[11]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[12]  Graham Cormode,et al.  Sampling for big data: a tutorial , 2014, KDD.

[13]  William G. Cochran,et al.  Sampling Techniques, 3rd Edition , 1963 .

[14]  Reza Olfati-Saber,et al.  Distributed Kalman filtering for sensor networks , 2007, 2007 46th IEEE Conference on Decision and Control.

[15]  Hyeonseung Im,et al.  Measurement Noise Recommendation for Efficient Kalman Filtering over a Large Amount of Sensor Data , 2019, Sensors.

[16]  Kyong-Ho Lee,et al.  Q-ASSF: Query-adaptive semantic stream filtering , 2015, Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015).

[17]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[18]  Peter J. Haas,et al.  Data-Stream Sampling: Basic Techniques and Results , 2016, Data Stream Management.