Dynamic Resource-Efficient Scheduling in Data Stream Management Systems Deployed on Computing Clouds

Scheduling streaming applications in Data Stream Management Systems (DSMS) have been investigated for years. As the deployment platform of DSMS migrates from onpremise clusters to elastic computing clouds, new requirements have emerged for the scheduling process to tackle workload fluctuations with heterogeneous cloud resources. Resource-efficient scheduling is to improve cost-efficiency at runtime by dynamically matching the resource demands of streaming applications with the resource availability of computing nodes. In this paper, we model the scheduling problem as a bin-packing variant and propose a heuristic-based algorithm to solve it with minimised inter-node communication. We also present a prototype scheduler named D-Storm, which extends the original Apache Storm framework into a self-adaptive MAPE-K (Monitoring, Analysis, Planning, Execution, Knowledge) architecture and validates the efficacy and efficiency of our scheduling algorithm. The evaluation carried out on real-world applications such as Twitter Sentiment Analysis proves that D-Storm outperforms the existing resource-aware scheduler and the default Storm scheduler in terms of the reduction of inter-node traffic and application latency, as well as resulting in a significant amount of resource savings through task consolidation.

[1]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[2]  Richard E. Korf,et al.  Bin-Completion Algorithms for Multicontainer Packing and Covering Problems , 2005, IJCAI.

[3]  Antonio Puliafito,et al.  Making the Internet of Things a Reality: The WhereX Solution , 2010 .

[4]  J. Desrosiers,et al.  BRANCH-PRICE-AND-CUT ALGORITHMS , 2011 .

[5]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[6]  Finn Årup Nielsen,et al.  A New ANEW: Evaluation of a Word List for Sentiment Analysis in Microblogs , 2011, #MSM.

[7]  Omer F. Rana,et al.  CONCURRENCYANDCOMPUTATION : PRACTICE AND EXPERIENCE Towards autonomic management for Cloud services based upon volunteered resources , 2011 .

[8]  Kishor S. Trivedi,et al.  Combining Cloud and sensors in a smart city environment , 2012, EURASIP J. Wirel. Commun. Netw..

[9]  Daniele Vigo,et al.  Bin packing approximation algorithms: Survey and classification , 2013 .

[10]  Giancarlo Fortino,et al.  Managing Data and Processes in Cloud-Enabled Large-Scale Sensor Networks: State-of-the-Art and Future Research Directions , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[11]  Roberto Baldoni,et al.  Adaptive online scheduling in storm , 2013, DEBS.

[12]  Boris Goldengorin,et al.  Handbook of combinatorial optimization , 2013 .

[13]  Stratis Viglas,et al.  Fast Heuristics for Near-Optimal Task Allocation in Data Stream Processing over Clusters , 2014, CIKM.

[14]  Jian Tang,et al.  T-Storm: Traffic-Aware Online Scheduling in Storm , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[15]  Mohammad Hosseini,et al.  R-Storm: Resource-Aware Scheduling in Storm , 2015, Middleware.

[16]  Abraham Bernstein,et al.  Workload scheduling in distributed stream processors using graph partitioning , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[17]  Keqin Li,et al.  Re-Stream: Real-time and energy-efficient resource scheduling in big data stream computing environments , 2015, Inf. Sci..

[18]  Vincenzo Grassi,et al.  On QoS-aware scheduling of data stream applications over fog computing infrastructures , 2015, 2015 IEEE Symposium on Computers and Communication (ISCC).

[19]  Dawei Sun,et al.  A Stable Online Scheduling Strategy for Real-Time Stream Computing Over Fluctuating Big Data Streams , 2016, IEEE Access.

[20]  Tao Li,et al.  Efficient Data Redistribution to Speedup Big Data Analytics in Large Systems , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[21]  Bin Cheng,et al.  Edge-Computing-Aware Deployment of Stream Processing Tasks Based on Topology-External Information: Model, Algorithms, and a Storm-Based Prototype , 2016, 2016 IEEE International Congress on Big Data (BigData Congress).

[22]  Omer F. Rana,et al.  Resource management for bursty streams on multi-tenancy cloud environments , 2016, Future Gener. Comput. Syst..

[23]  Jian Tang,et al.  Performance Modeling and Predictive Scheduling for Distributed Stream Data Processing , 2016, IEEE Transactions on Big Data.

[24]  Chunlin Li,et al.  Real-time scheduling based on optimized topology and communication traffic in distributed real-time computation platform of storm , 2017, J. Netw. Comput. Appl..

[25]  Kun-Lung Wu,et al.  Low-synchronization, mostly lock-free, elastic scheduling for streaming runtimes , 2017, PLDI.

[26]  Rajkumar Buyya,et al.  D-Storm: Dynamic Resource-Efficient Scheduling of Stream Processing Applications , 2017, 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS).

[27]  Keqin Li,et al.  Building a fault tolerant framework with deadline guarantee in big data stream computing environments , 2017, J. Comput. Syst. Sci..

[28]  Pavel A. Smirnov,et al.  Performance-aware scheduling of streaming applications using genetic algorithm , 2017, ICCS.

[29]  Yogesh L. Simmhan,et al.  Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[30]  Yogesh L. Simmhan,et al.  Model-driven Scheduling for Distributed Stream Processing Systems , 2017, J. Parallel Distributed Comput..

[31]  Rajkumar Buyya,et al.  Performance-Oriented Deployment of Streaming Applications on Cloud , 2019, IEEE Transactions on Big Data.