A Market-Oriented Heuristic Algorithm for Scheduling Parallel Applications in Big Data Service Platform

Big Data analytics service platform delivers a new type of public cloud offerings, through which end users can outsource their job executions by using a group of professional Big Data processing services in a pay-per-use way. Different from other type of cloud services, parallel jobs dominate the domain of data processing services, whose execution time can be varied greatly with different runtime configurations, such as different degrees of parallelism. In such a market-oriented environment, scheduling jobs from end users efficiently to optimize the Big Data analytics service platform's revenue is a more challenging task. In this paper, we propose a market-oriented heuristic algorithm for scheduling parallel jobs in a Big Data analytics service platform with admission control to optimize the platform operator's revenue. The proposed scheduling heuristic takes into account not only the dynamic revenue gained from accomplishing a job within a specific runtime as well as the consumption of resources needed for running it to achieve this given runtime, but also the potential loss it causes to the system by running this job instead of other waiting jobs currently in the system. We also propose a collaborative filtering based approach to quickly and accurately predict the execution time of parallel jobs running in a Big Data analytics service platform. We have conducted extensive experiments and simulations based on workload data derived from the real-world data analytics service platform and parallel applications. We show that our scheduler can outperform the other scheduling algorithms used for comparison, which are based on classical heuristics from literature, thereby fully evaluating the effectiveness of our market-oriented heuristic scheduling algorithm.

[1]  David E. Culler,et al.  Market-based cluster resource management , 2001 .

[2]  Hiranya Jayathilaka,et al.  Response time service level agreements for cloud-hosted web applications , 2015, SoCC.

[3]  Eamonn J. Keogh,et al.  Time series shapelets: a new primitive for data mining , 2009, KDD.

[4]  David E. Irwin,et al.  Balancing risk and reward in a market-based task service , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[5]  Mauro Iacono,et al.  Performance evaluation of NoSQL big-data applications using multi-formalism models , 2014, Future Gener. Comput. Syst..

[6]  Rajkumar Buyya,et al.  A Taxonomy of Performance Prediction Systems in the Parallel and Distributed Computing Grids , 2013, ArXiv.

[7]  Wenyi Huang,et al.  Towards building a scholarly big data platform: Challenges, lessons and opportunities , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[8]  Boon Thau Loo,et al.  Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[9]  Yogesh L. Simmhan,et al.  Cloud-Based Software Platform for Big Data Analytics in Smart Grids , 2013, Computing in Science & Engineering.

[10]  Qingshi Shao,et al.  A collaborative filtering based approach to performance prediction for parallel applications , 2017, 2017 IEEE 21st International Conference on Computer Supported Cooperative Work in Design (CSCWD).

[11]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[12]  Jan U. Becker,et al.  Market-Oriented Management: A Systems-Based Perspective , 1999 .

[13]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[14]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[15]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[16]  Xiao Liu,et al.  A market-oriented hierarchical scheduling strategy in cloud workflow systems , 2011, The Journal of Supercomputing.

[17]  David E. Culler,et al.  User-Centric Performance Analysis of Market-Based Cluster Batch Schedulers , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[18]  Shijun Liu,et al.  IBDP: An Industrial Big Data Ingestion and Analysis Platform and Case Studies , 2015, 2015 International Conference on Identification, Information, and Knowledge in the Internet of Things (IIKI).

[19]  Rajkumar Buyya,et al.  SLA-Based Resource Scheduling for Big Data Analytics as a Service in Cloud Computing Environments , 2015, 2015 44th International Conference on Parallel Processing.

[20]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.