论文信息 - A Lightweight Model for Right-Sizing Master-Worker Applications

A Lightweight Model for Right-Sizing Master-Worker Applications

When running a parallel application at scale, a resource provisioning policy should minimize over-commitment (idle resources) and under-commitment (resource contention). However, users seldom know the quantity of resources to appropriately execute their application. Even with such knowledge, over- and under-commitment of resources may still occur because the application does not run in isolation. It shares resources such as network and filesystems. We formally define the capacity of a parallel application as the quantity of resources that may effectively be provisioned for the best execution time in an environment. We present a model to compute an estimate of the capacity of master-worker applications as they run based on execution and data-transfer times. We demonstrate this model with two bioinformatics workflows, a machine learning application, and one synthetic application. Our results show the model correctly tracks the known value of capacity in scaling, dynamic task behavior, and with improvements in task throughput.

Douglas Thain | Nathaniel Kremer-Herman | Benjamín Tovar

[1] Laxmikant V. Kalé,et al. A distributed dynamic load balancer for iterative applications , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2] Laxmikant V. Kalé,et al. A Batch System with Efficient Adaptive Scheduling for Malleable and Evolving Applications , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[3] Douglas Thain,et al. Makeflow: a portable abstraction for data intensive computing on clusters, clouds, and grids , 2012, SWEET '12.

[4] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5] Radu Prodan,et al. Dynamic load management for MMOGs in distributed environments , 2010, CF '10.

[6] Jacek Kitowski,et al. Self-scalable services in service oriented software for cost-effective data farming , 2016, Future Gener. Comput. Syst..

[7] James G. Shanahan,et al. Large Scale Distributed Data Science using Apache Spark , 2015, KDD.

[8] Mor Harchol-Balter,et al. AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers , 2012, TOCS.

[9] Robert E. Benner,et al. Development of Parallel Methods for a $1024$-Processor Hypercube , 1988 .

[10] John L. Gustafson,et al. The Twin Bottleneck Effect , 1993 .

[11] Chung-Horng Lung,et al. Measuring Prediction Sensitivity of a Cloud Auto-scaling System , 2014, 2014 IEEE 38th International Computer Software and Applications Conference Workshops.

[12] Douglas Thain,et al. Scaling Up Bioinformatics Workflows with Dynamic Job Expansion: A Case Study Using Galaxy and Makeflow , 2015, 2015 IEEE 11th International Conference on e-Science.

[13] Miron Livny,et al. Distributed computing in practice: the Condor experience: Research Articles , 2005 .

[14] José Antonio Lozano,et al. A Review of Auto-scaling Techniques for Elastic Applications in Cloud Environments , 2014, Journal of Grid Computing.

[15] Karsten Schwan,et al. Active workflow system for near real-time extreme-scale science , 2014, PPAA '14.

[16] W. Walker,et al. Mpi: a Standard Message Passing Interface 1 Mpi: a Standard Message Passing Interface , 1996 .

[17] S. Krishnaprasad,et al. Uses and abuses of Amdahl's law , 2001 .

[18] Jitendra Padhye,et al. Duet: cloud scale load balancing with hardware and software , 2015, SIGCOMM.

[19] Marta Mattoso,et al. Evaluating parameter sweep workflows in high performance computing , 2012, SWEET '12.

[20] John L. Gustafson,et al. Reevaluating Amdahl's law , 1988, CACM.

[21] Scott J. Emrich,et al. HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning , 2017 .

[22] Douglas Thain,et al. Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[23] Shivnath Babu,et al. Tempo: Robust and Self-Tuning Resource Management in Multi-tenant Parallel Databases , 2015, Proc. VLDB Endow..

[24] Douglas Thain,et al. Case Studies in Designing Elastic Applications , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[25] Aniruddha S. Gokhale,et al. Efficient Autoscaling in the Cloud Using Predictive Models for Workload Forecasting , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[26] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[27] Li Yu,et al. Right-sizing resource allocations for scientific applications in clusters, grids, and clouds , 2013 .

[28] Douglas Thain,et al. SHADHO: Massively Scalable Hardware-Aware Distributed Hyperparameter Optimization , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[29] Nicholas Carriero,et al. How to write parallel programs: a guide to the perplexed , 1989, CSUR.