Improving Per-Node Computing Efficiency by an Adaptive Lock-Free Scheduling Model

Job scheduling on many-core computers with tens or even hundreds of processing cores is one of the key technologies in High Performance Computing (HPC) systems. Despite many scheduling algorithms have been proposed, scheduling remains a challenge for executing highly effective jobs that are assigned in a single computing node with diverse scheduling objectives. On the other hand, the increasing scale and the need for rapid response to changing requirements are hard to meet with existing scheduling models in an HPC node. To address these issues, we propose a novel adaptive scheduling model that is applied to a single node with a many-core processor; this model solves the problems of scheduling efficiency and scalability through an adaptive optimistic control mechanism. This mechanism exposes information such that all the cores are provided with jobs and the tools necessary to take advantage of that information and thus compete for resources in an uncoordinated manner. At the same time, the mechanism is equipped with adaptive control, allowing it to adjust the number of running tools dynamically when frequent conflict happens. We justify this scheduling model and present the simulation results for synthetic and real-world HPC workloads, in which we compare our proposed model with two widely used scheduling models, i.e. multi-path monolithic and two-level scheduling. The proposed approach outperforms the other models in scheduling efficiency and scalability. Our results demonstrate that the adaptive optimistic control affords significant improvements for HPC workloads in the parallelism of the node-level scheduling model and performance. key words: job scheduling, adaptive lock-free scheduling, optimistic concurrency control, high performance computing, many-core

[1]  Vivien Quéma,et al.  The Linux scheduler: a decade of wasted cores , 2016, EuroSys.

[2]  J. Kubiatowicz,et al.  Resource Management in the Tessellation Manycore OS ∗ , 2010 .

[3]  Larry Rudolph,et al.  Towards Convergence in Job Schedulers for Parallel Supercomputers , 1996, JSSPP.

[4]  Henri Casanova,et al.  Dynamic fractional resource scheduling for HPC workloads , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[5]  Kevin Klues,et al.  Improving per-node efficiency in the datacenter with new OS abstractions , 2011, SoCC.

[6]  Alexandra Fedorova,et al.  Contention-Aware Scheduling on Multicore Systems , 2010, TOCS.

[7]  Tianhao Zhang,et al.  Do-it-yourself virtual memory translation , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[8]  Pascal Bouvry,et al.  Evalix: Classification and Prediction of Job Resource Consumption on HPC Platforms , 2015, JSSPP.

[9]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[10]  Volkmar Uhlig,et al.  The mechanics of in-kernel synchronization for a scalable microkernel , 2007, OPSR.

[11]  Gokcen Kestor,et al.  On the Application Task Granularity and the Interplay with the Scheduling Overhead in Many-Core Shared Memory Systems , 2015, 2015 IEEE International Conference on Cluster Computing.

[12]  Hui Li,et al.  Workload Characteristics of a Multi-cluster Supercomputer , 2004, JSSPP.

[13]  Anand Sivasubramaniam,et al.  An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration , 2001, JSSPP.

[14]  Anant Agarwal,et al.  Factored operating systems (fos): the case for a scalable operating system for multicores , 2009, OPSR.

[15]  P. RodrigoGonzalo HPC scheduling in a brave new world , 2017 .

[16]  Xiao Zhang,et al.  Towards practical page coloring-based multicore cache management , 2009, EuroSys '09.

[17]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[18]  Rizos Sakellariou,et al.  Mapping Virtual Machines onto Physical Machines in Cloud Computing , 2016, ACM Comput. Surv..

[19]  Haihang You,et al.  Comprehensive Workload Analysis and Modeling of a Petascale Supercomputer , 2012, JSSPP.

[20]  S. Di,et al.  Characterization and Comparison of Google Cloud Load versus Grids , 2012 .

[21]  Martin Schulz,et al.  Enabling fair pricing on HPC systems with node sharing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Dror G. Feitelson,et al.  Gang scheduling with memory considerations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[23]  Robert J. Creasy,et al.  The Origin of the VM/370 Time-Sharing System , 1981, IBM J. Res. Dev..

[24]  Weizhen Mao,et al.  Improved Parallel Job Scheduling with Overhead , 2011 .

[25]  Mark S. Squillante,et al.  Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.

[26]  Benjie Chen,et al.  Multiprocessing with the Exokernel Operating System , 2000 .

[27]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[28]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[29]  Dawson R. Engler,et al.  Exokernel: an operating system architecture for application-level resource management , 1995, SOSP.

[30]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[31]  Adrian Schüpbach,et al.  Your computer is already a distributed system. Why isn't your OS? , 2009, HotOS.

[32]  Bernd Freisleben,et al.  Xen and the Art of Cluster Scheduling , 2006, First International Workshop on Virtualization Technology in Distributed Computing (VTDC 2006).

[33]  Timothy Roscoe,et al.  Arrakis , 2014, OSDI.

[34]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[35]  Barbara M. Chapman,et al.  Performance modeling of communication and computation in hybrid MPI and OpenMP applications , 2006, 12th International Conference on Parallel and Distributed Systems - (ICPADS'06).

[36]  Dror G. Feitelson,et al.  Workload Modeling for Performance Evaluation , 2002, Performance.

[37]  Geoffrey C. Fox,et al.  Big Data, Simulations and HPC Convergence , 2015, WBDB.

[38]  Mikel Luján,et al.  A Study of a Transactional Parallel Routing Algorithm , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[39]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[40]  Yang Zhang,et al.  Corey: An Operating System for Many Cores , 2008, OSDI.

[41]  José Duato,et al.  Cache-Hierarchy Contention-Aware Scheduling in CMPs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[42]  Erik Elmroth,et al.  Towards understanding HPC users and systems: A NERSC case study , 2018, J. Parallel Distributed Comput..

[43]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.