A Lightweight Model for Right-Sizing Master-Worker Applications

When running a parallel application at scale, a resource provisioning policy should minimize over-commitment (idle resources) and under-commitment (resource contention). However, users seldom know the quantity of resources to appropriately execute their application. Even with such knowledge, over- and under-commitment of resources may still occur because the application does not run in isolation. It shares resources such as network and filesystems. We formally define the capacity of a parallel application as the quantity of resources that may effectively be provisioned for the best execution time in an environment. We present a model to compute an estimate of the capacity of master-worker applications as they run based on execution and data-transfer times. We demonstrate this model with two bioinformatics workflows, a machine learning application, and one synthetic application. Our results show the model correctly tracks the known value of capacity in scaling, dynamic task behavior, and with improvements in task throughput.

[1]  Laxmikant V. Kalé,et al.  A distributed dynamic load balancer for iterative applications , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Laxmikant V. Kalé,et al.  A Batch System with Efficient Adaptive Scheduling for Malleable and Evolving Applications , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[3]  Douglas Thain,et al.  Makeflow: a portable abstraction for data intensive computing on clusters, clouds, and grids , 2012, SWEET '12.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Radu Prodan,et al.  Dynamic load management for MMOGs in distributed environments , 2010, CF '10.

[6]  Jacek Kitowski,et al.  Self-scalable services in service oriented software for cost-effective data farming , 2016, Future Gener. Comput. Syst..

[7]  James G. Shanahan,et al.  Large Scale Distributed Data Science using Apache Spark , 2015, KDD.

[8]  Mor Harchol-Balter,et al.  AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers , 2012, TOCS.

[9]  Robert E. Benner,et al.  Development of Parallel Methods for a $1024$-Processor Hypercube , 1988 .

[10]  John L. Gustafson,et al.  The Twin Bottleneck Effect , 1993 .

[11]  Chung-Horng Lung,et al.  Measuring Prediction Sensitivity of a Cloud Auto-scaling System , 2014, 2014 IEEE 38th International Computer Software and Applications Conference Workshops.

[12]  Douglas Thain,et al.  Scaling Up Bioinformatics Workflows with Dynamic Job Expansion: A Case Study Using Galaxy and Makeflow , 2015, 2015 IEEE 11th International Conference on e-Science.

[13]  Miron Livny,et al.  Distributed computing in practice: the Condor experience: Research Articles , 2005 .

[14]  José Antonio Lozano,et al.  A Review of Auto-scaling Techniques for Elastic Applications in Cloud Environments , 2014, Journal of Grid Computing.

[15]  Karsten Schwan,et al.  Active workflow system for near real-time extreme-scale science , 2014, PPAA '14.

[16]  W. Walker,et al.  Mpi: a Standard Message Passing Interface 1 Mpi: a Standard Message Passing Interface , 1996 .

[17]  S. Krishnaprasad,et al.  Uses and abuses of Amdahl's law , 2001 .

[18]  Jitendra Padhye,et al.  Duet: cloud scale load balancing with hardware and software , 2015, SIGCOMM.

[19]  Marta Mattoso,et al.  Evaluating parameter sweep workflows in high performance computing , 2012, SWEET '12.

[20]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[21]  Scott J. Emrich,et al.  HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning , 2017 .

[22]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[23]  Shivnath Babu,et al.  Tempo: Robust and Self-Tuning Resource Management in Multi-tenant Parallel Databases , 2015, Proc. VLDB Endow..

[24]  Douglas Thain,et al.  Case Studies in Designing Elastic Applications , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[25]  Aniruddha S. Gokhale,et al.  Efficient Autoscaling in the Cloud Using Predictive Models for Workload Forecasting , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[26]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[27]  Li Yu,et al.  Right-sizing resource allocations for scientific applications in clusters, grids, and clouds , 2013 .

[28]  Douglas Thain,et al.  SHADHO: Massively Scalable Hardware-Aware Distributed Hyperparameter Optimization , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[29]  Nicholas Carriero,et al.  How to write parallel programs: a guide to the perplexed , 1989, CSUR.