Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support

In our hybrid runtime (HRT) model, a parallel runtime system and the application are together transformed into a specialized OS kernel that operates entirely in kernel mode and can thus implement exactly its desired abstractions on top of fully privileged hardware access. We describe the design and implementation of two new tools that support the HRT model. The first, the Nautilus Aerokernel, is a kernel framework specifically designed to enable HRTs for x64 and Xeon Phi hardware. Aerokernel primitives are specialized for HRT creation and thus can operate much faster, up to two orders of magnitude faster, than related primitives in Linux. Aerokernel primitives also exhibit much lower variance in their performance, an important consideration for some forms of parallelism. We have realized several prototype HRTs, including one based on the Legion runtime, and we provide application macrobenchmark numbers for our Legion HRT. The second tool, the hybrid virtual machine (HVM), is an extension to the Palacios virtual machine monitor that allows a single virtual machine to simultaneously support a traditional OS and software stack alongside an HRT with specialized hardware access. The HRT can be booted in a time comparable to a Linux user process startup, and functions in the HRT, which operate over the user process's memory, can be invoked by the process with latencies not much higher than those of a function call.

[1]  Brian N. Bershad,et al.  Scheduler activations: effective kernel support for the user-level management of parallelism , 1991, TOCS.

[2]  Reuben Olinsky,et al.  Composing OS extensions safely and efficiently with Bascule , 2013, EuroSys '13.

[3]  Daniel S. Katz,et al.  Design and evaluation of the gemtc framework for GPU-enabled many-task computing , 2014, HPDC '14.

[4]  Giuseppe Coviello,et al.  COSMIC: middleware for high performance and reliable multiprocessing on xeon phi coprocessors , 2013, HPDC '13.

[5]  Ron Brightwell,et al.  Characterizing application sensitivity to OS interference using kernel-level noise injection , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Jonathan S. Shapiro,et al.  The KeyKOS Nanokernel Architecture , 1992, USENIX Workshop on Microkernels and Other Kernel Architectures.

[7]  James R. Larus,et al.  Singularity: rethinking the software stack , 2007, OPSR.

[8]  Larry L. Peterson,et al.  Scout: a communications-oriented operating system , 1995, Proceedings 5th Workshop on Hot Topics in Operating Systems (HotOS-V).

[9]  Dawson R. Engler,et al.  Exterminate all operating system abstractions , 1995, Proceedings 5th Workshop on Hot Topics in Operating Systems (HotOS-V).

[10]  Peter A. Dinda,et al.  Enhancing virtualized application performance through dynamic adaptive paging mode selection , 2011, ICAC '11.

[11]  Donald E. Porter,et al.  Rethinking the library OS from the top down , 2011, ASPLOS XVI.

[12]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[13]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[14]  John Kubiatowicz,et al.  Juggle: proactive load balancing on multicore computers , 2011, HPDC '11.

[15]  Leslie G. Valiant,et al.  Direct Bulk-Synchronous Parallel Algorithms , 1994, J. Parallel Distributed Comput..

[16]  Seyong Lee,et al.  OpenARC: open accelerator research compiler for directive-based, efficient heterogeneous computing , 2014, HPDC '14.

[17]  Manuel M. T. Chakravarty,et al.  Nepal - Nested Data Parallelism in Haskell , 2001, Euro-Par.

[18]  Don Marti,et al.  OSv - Optimizing the Operating System for Virtual Machines , 2014, USENIX Annual Technical Conference.

[19]  David R. Cheriton,et al.  A caching model of operating system kernel functionality , 1994, OSDI '94.

[20]  Dawson R. Engler,et al.  Exokernel: an operating system architecture for application-level resource management , 1995, SOSP.

[21]  Alexander Aiken,et al.  Language support for dynamic, hierarchical data partitioning , 2013, OOPSLA.

[22]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[23]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[24]  Sandia Report,et al.  Toward a New Metric for Ranking High Performance Computing Systems , 2013 .

[25]  Christoforos E. Kozyrakis,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 335 Dune: Safe User-level Access to Privileged Cpu Features , 2022 .

[26]  Rishi Khan,et al.  Towards a codelet-based runtime for exascale computing: position paper , 2012, EXADAPT '12.

[27]  Dilma Da Silva,et al.  Libra: a library operating system for a jvm in a virtualized execution environment , 2007, VEE '07.

[28]  Kevin Klues,et al.  Tessellation: space-time partitioning in a manycore client OS , 2009 .

[29]  Kevin T. Pedretti,et al.  The impact of system design parameters on application noise sensitivity , 2010, 2010 IEEE International Conference on Cluster Computing.

[30]  Brian N. Bershad,et al.  Extensibility safety and performance in the SPIN operating system , 1995, SOSP.

[31]  Robert Bruce Findler,et al.  Seeing the futures: profiling shared-memory parallel racket , 2012, FHPC '12.

[32]  David L. Black,et al.  Microkernel operating system architecture and Mach , 1991 .

[33]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[34]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[35]  Leslie G. Valiant,et al.  Direct Bulk-Synchronous Parallel Algorithms , 1992, J. Parallel Distributed Comput..

[36]  John H. Reppy,et al.  Manticore: a heterogeneous parallel language , 2007, DAMP '07.

[37]  J. Mellor-Crummey,et al.  A multi-platform co-array Fortran compiler , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[38]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[39]  Joel H. Saltz,et al.  Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..

[40]  Kevin T. Pedretti,et al.  Achieving Performance Isolation with Lightweight Co-Kernels , 2015, HPDC.

[41]  Yang Zhang,et al.  Corey: An Operating System for Many Cores , 2008, OSDI.

[42]  Thomas E. Anderson,et al.  Arrakis: A Case for the End of the Empire , 2013, HotOS.

[43]  Timothy Roscoe,et al.  Linkage in the Nemesis single address space operating system , 1994, OPSR.

[44]  Torsten Hoefler,et al.  Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[45]  Guy E. Blelloch,et al.  A provable time and space efficient implementation of NESL , 1996, ICFP '96.

[46]  Peter A. Dinda,et al.  Guarded Modules: Adaptively Extending the VMM's Privilege Into the Guest , 2014, ICAC.

[47]  Sandia Report,et al.  HPCG Technical Specification , 2013 .

[48]  John H. Reppy,et al.  Implicitly-threaded parallelism in Manticore , 2008, Journal of Functional Programming.

[49]  Simon L. Peyton Jones,et al.  Data parallel Haskell: a status report , 2007, DAMP '07.

[50]  P. Menage Adding Generic Process Containers to the Linux Kernel , 2010 .

[51]  Jochen Liedtke,et al.  On micro-kernel construction , 1995, SOSP.

[52]  Peter A. Dinda,et al.  Places: adding message-passing parallelism to racket , 2011, DLS '11.

[53]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[54]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[55]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[56]  Lars Bergstrom,et al.  Nested data-parallelism on the gpu , 2012, ICFP 2012.

[57]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[58]  Dilma Da Silva,et al.  K42: building a complete operating system , 2006, EuroSys.

[59]  Jon Crowcroft,et al.  Unikernels: library operating systems for the cloud , 2013, ASPLOS '13.

[60]  Peter A. Dinda,et al.  Back to the futures: incremental parallelization of existing sequential runtime systems , 2010, OOPSLA.

[61]  Peter A. Dinda,et al.  A Case for Transforming Parallel Runtimes Into Operating System Kernels , 2015, HPDC.

[62]  Peter A. Dinda,et al.  Palacios and Kitten: New high performance operating systems for scalable virtualized and native supercomputing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[63]  Rolf Riesen,et al.  mOS: an architecture for extreme-scale operating systems , 2014, ROSS@ICS.

[64]  Lars Bergstrom,et al.  Data-only flattening for nested data parallelism , 2013, PPoPP '13.