NO2: Speeding up Parallel Processing of Massive Compute-Intensive Tasks

Large-scale computing frameworks, either tenanted on the cloud or deployed in the high-end local cluster, have become an indispensable software infrastructure to support numerous enterprise and scientific applications. Tasks executed on these frameworks are generally classified into data-intensive and compute-intensive ones. However, most existing frameworks, led by MapReduce, are mainly suitable for data-intensive tasks. Their task schedulers assume that the proportion of data I/O reflects the task progress and state. Unfortunately, this assumption does not apply to most compute-intensive tasks. Due to biased estimation of task progress, traditional frameworks cannot timely cut off outliers and therefore largely prolong execution time when performing compute-intensive tasks. We propose a new framework designed for compute-intensive tasks. By using instrumentation and automatic instrument point selector, our framework estimates the compute-intensive task progress without resorting to data I/O. We employ a clustering method to identify outliers at runtime and perform speculative execution/aborting, speeding up task execution by up to 25%. Moreover, our improvement to bare instrumentation limits overhead within 0.1%, and the aborting-based execution only introduces 10% more average CPU usage. Low overhead and resource consumption make our framework practically usable in the production environment.

[1]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[2]  Sathiamoorthy Manoharan,et al.  Effect of task duplication on the assignment of dependency graphs , 2001, Parallel Comput..

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Miron Livny,et al.  Condor: a distributed job scheduler , 2001 .

[5]  Ishfaq Ahmad,et al.  On Exploiting Task Duplication in Parallel Program Scheduling , 1998, IEEE Trans. Parallel Distributed Syst..

[6]  Dharma P. Agrawal,et al.  A scalable task duplication based scheduling algorithm for heterogeneous systems , 2000, Proceedings 2000 International Conference on Parallel Processing.

[7]  Insung Park Event Tracing for Windows: Best Practices , 2004, Int. CMG Conference.

[8]  Atakan Dogan,et al.  LDBS: a duplication based scheduling algorithm for heterogeneous computing systems , 2002, Proceedings International Conference on Parallel Processing.

[9]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[10]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[11]  Mona Attariyan,et al.  AutoBash: improving configuration management with operating system causality analysis , 2007, SOSP.

[12]  Mor Harchol-Balter,et al.  Task assignment in a distributed system (extended abstract): improving performance by unbalancing load , 1997, SIGMETRICS '98/PERFORMANCE '98.

[13]  Mor Harchol-Balter Task assignment with unknown duration , 2002, JACM.

[14]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[15]  Michel Dagenais,et al.  Measuring and Characterizing System Behavior Using Kernel-Level Event Logging , 2000, USENIX Annual Technical Conference, General Track.

[16]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[17]  Judy Qiu,et al.  Cloud Technologies for Bioinformatics Applications , 2011, IEEE Trans. Parallel Distributed Syst..

[18]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[19]  Xiaofeng Gao,et al.  A Performance Prediction Framework for Scientific Applications , 2003, International Conference on Computational Science.

[20]  Denis Caromel,et al.  A High Performance Java Middleware with a Real Application , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[21]  Arif Ghafoor,et al.  Semi-Distributed Load Balancing For Massively Parallel Multicomputer Systems , 1991, IEEE Trans. Software Eng..

[22]  Miron Livny,et al.  Scheduling Mixed Workloads in Multi-grids: The Grid Execution Hierarchy , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[23]  Eero Vainikko,et al.  Adapting scientific computing problems to clouds using MapReduce , 2012, Future Gener. Comput. Syst..

[24]  Bora Uçar,et al.  Task assignment in heterogeneous computing systems , 2006, J. Parallel Distributed Comput..

[25]  Miron Livny,et al.  Adaptive Scheduling for Master-Worker Applications on the Computational Grid , 2000, GRID.