SpeQuloS: a QoS service for BoT applications using best effort distributed computing infrastructures

Exploitation of Best Effort Distributed Computing Infrastructures (BE-DCIs) allow operators to maximize the utilization of the infrastructures, and users to access the unused resources at relatively low cost. Because providers do not guarantee that the computing resources remain available to the user during the entire execution of their applications, they offer a diminished Quality of Service (QoS) compared to traditional infrastructures. Profiling the execution of Bag-of-Tasks (BoT) applications on several kinds of BE-DCIs demonstrates that their task completion rate drops near the end of the execution. In this paper, we present the SpeQuloS framework which enhances the QoS of BoT applications executed on BE-DCIs by reducing the execution time, improving its stability, and reporting to users a predicted completion time. SpeQuloS monitors the execution of the BoT on the BE-DCIs, and dynamically supplies fast and reliable Cloud resources when the critical part of the BoT is executed. We present the design and development of the service and several strategies to decide when and how Cloud resources should be provisioned. Performance evaluation using simulations shows that SpeQuloS fulfill its objectives. It speeds-up the execution of BoTs, in the best cases by a factor greater than 2, while offloading less than 2.5% of the workload to the Cloud. We report on preliminary results after a complex deployment as part of the European Desktop Grid Infrastructure.

[1]  Trilce Estrada,et al.  Modeling Job Lifespan Delays in Volunteer Computing Projects , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[2]  Ali Raza Butt,et al.  CATCH: A Cloud-Based Adaptive Data Transfer Service for HPC , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[3]  Thilo Kielmann,et al.  Bag-of-Tasks Scheduling under Budget Constraints , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[4]  Dhabaleswar K. Panda,et al.  QoPS: A QoS Based Scheme for Parallel Job Scheduling , 2003, JSSPP.

[5]  Paul Marshall,et al.  Improving Utilization of Infrastructure Clouds , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[6]  Jean-Marc Vincent,et al.  Mining for Availability Models in Large-Scale Distributed Systems:A Case Study of SETI@home , 2009 .

[7]  Andrew A. Chien,et al.  Resource Management for Rapid Application Turnaround on Enterprise Desktop Grids , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[8]  Nazareno Andrade,et al.  OurGrid: An Approach to Easily Assemble Grids with Equitable Resource Sharing , 2003, JSSPP.

[9]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[10]  Richard Wolski,et al.  Fault-aware scheduling for Bag-of-Tasks applications on Desktop Grids , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[11]  Georges Da Costa,et al.  2005 IEEE International Symposium on Cluster Computing and the Grid , 2005, CCGRID.

[12]  Franck Cappello,et al.  Cost-benefit analysis of Cloud Computing versus desktop grids , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[13]  Paul Marshall,et al.  Elastic Site: Using Clouds to Elastically Extend Site Resources , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[14]  Nazareno Andrade,et al.  Automatic grid assembly by promoting collaboration in peer-to-peer grids , 2007, J. Parallel Distributed Comput..

[15]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[16]  Assaf Schuster,et al.  GridBot: execution of bags of tasks in multiple grids , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[17]  Franck Cappello,et al.  Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed , 2006, Int. J. High Perform. Comput. Appl..

[18]  Gilles Fedak,et al.  EDGeS: Bridging EGEE to BOINC and XtremWeb , 2009, Journal of Grid Computing.

[19]  Selim G. Akl,et al.  Scheduling Algorithms for Grid Computing: State of the Art and Open Problems , 2006 .

[20]  Eduardo Huedo,et al.  On the use of clouds for grid resource provisioning , 2011, Future Gener. Comput. Syst..

[21]  Marty Humphrey,et al.  Auto-scaling to minimize cost and meet application deadlines in cloud workflows , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[23]  Michael J. Lewis,et al.  Multi-state grid resource availability characterization , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[24]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[25]  Chuliang Weng,et al.  Heuristic scheduling for bag-of-tasks applications in combination with QoS in the computational grid , 2005, Future Gener. Comput. Syst..

[26]  Gilles Fedak,et al.  XtremWeb: a generic global computing system , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[27]  Richard Wolski,et al.  QBETS: queue bounds estimation from time series , 2007, SIGMETRICS '07.

[28]  Rajkumar Buyya,et al.  The Aneka platform and QoS-driven resource provisioning for elastic applications on hybrid Clouds , 2012, Future Gener. Comput. Syst..

[29]  David P. Anderson,et al.  Correlated Resource Models of Internet End Hosts , 2010, 2011 31st International Conference on Distributed Computing Systems.

[30]  Tran Ngoc Minh,et al.  Towards a profound analysis of bags-of-tasks in parallel systems and their performance impact , 2011, HPDC '11.

[31]  Alexandru Iosup,et al.  The performance of bags-of-tasks in large-scale distributed systems , 2008, HPDC '08.

[32]  Alexandru Iosup,et al.  ExPERT: Pareto-Efficient Task Replication on Grids and a Cloud , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[33]  Péter Kacsuk,et al.  Workers in the Clouds , 2011, 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing.

[34]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[35]  Dan Geiger,et al.  Exact genetic linkage computations for general pedigrees , 2002, ISMB.

[36]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[37]  Matei Ripeanu,et al.  Amazon S3 for science grids: a viable solution? , 2008, DADC '08.

[38]  Jean-Marc Vincent,et al.  Mining for statistical models of availability in large-scale distributed systems: An empirical study of SETI@home , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.