Adaptive scheduling of parallel jobs in spark streaming

Streaming data analytics has become increasingly vital in many applications such as dynamic content delivery (e.g., advertisements), Twitter sentiment analysis, and security event processing (e.g., intrusion detection systems, and spam filters). Emerging stream processing systems, such as Spark Streaming, treat the continuous stream as a series of micro-batches of data and continuously process these micro-batch jobs. Such micro-batch based stream processing provides several advantages over traditional stream processing systems, which process streaming data one record at a time, including fast recovery from failures, better load balancing and scalability. However, efficient scheduling of micro-batch jobs to achieve high throughput and low latency is very challenging due to the complex data dependency and dynamism inherent in streaming workloads. In this paper, we propose A-scheduler, an adaptive scheduling approach that dynamically schedules parallel micro-batch jobs in Spark Streaming and automatically adjusts scheduling parameters to improve performance and resource efficiency. Specifically, A-scheduler dynamically schedules multiple jobs concurrently using different policies based on their data dependencies and automatically adjusts the level of job parallelism and resource shares among jobs based on workload properties. We implemented A-scheduler and evaluated it with a real-time security event processing workload. Our experimental results show that A-scheduler can reduce end-to-end latency by 42% and improve workload throughput and energy efficiency by 21% and 13%, respectively, compared to the default Spark Streaming scheduler.

[1]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[2]  Changjun Jiang,et al.  Towards Energy Efficiency in Heterogeneous Hadoop Clusters by Adaptive Task Assignment , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[3]  Changjun Jiang,et al.  Heterogeneity-Aware Workload Placement and Migration in Distributed Sustainable Datacenters , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[4]  Thomas S. Heinze,et al.  Online parameter optimization for elastic data stream processing , 2015, SoCC.

[5]  Zhengping Qian,et al.  TimeStream: reliable stream computation in the cloud , 2013, EuroSys '13.

[6]  Scott Shenker,et al.  Adaptive Stream Processing using Dynamic Batch Sizing , 2014, SoCC.

[7]  Changjun Jiang,et al.  Resource and Deadline-Aware Job Scheduling in Dynamic Hadoop Clusters , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[8]  Matei Zaharia,et al.  Resilient Distributed Datasets , 2016 .

[9]  Hrishikesh Amur,et al.  ELF: Efficient Lightweight Fast Stream Processing at Scale , 2014, USENIX Annual Technical Conference.

[10]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[11]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[12]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[13]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[14]  Yanhui Geng,et al.  FLOWPROPHET: Generic and Accurate Traffic Prediction for Data-Parallel Cluster Computing , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[15]  Xiaobo Zhou,et al.  Improving MapReduce performance in heterogeneous environments with adaptive task tuning , 2014, Middleware.