Tornado: maximizing locality and concurrency in a shared memory multiprocessor operating system

We describe the design and implementation of Tornado, a new operating system designed from the ground up specifically for today's shared memory multiprocessors. The need for improved locality in the operating system is growing as multiprocessor hardware evolves, increasing the costs for cache misses and sharing, and adding complications due to NUMAness. Tornado is optimized so that locality and independence in application requests for operating system services-whetherfrom multiple sequential applications or a single parallel application-are mapped onto locality and independence in the servicing of these requests in the kernel and system servers. By contrast, previous shared memory multiprocessor operating systems all evolved from designs constructed at a time when sharing costs were low, memory latency was low and uniform, and caches were small; for these systems, concurrency was the main performance concern and locality was not an important issue. Tornado achieves this locality by starting with an object-oriented structure, where every virtual and physical resource is represented by an independent object. Locality, as well as concurrency, is further enhanced with the introduction of three key innovations: (i) clustered objects that support the partitioning of contended objects across processors, (ii) a protected procedure call facility that preserves the locality and concurrency of IPC's, and (iii) a new locking strategy that allows all locking to be encapsulated within the objects being protected and greatly simplifies the overall locking protocols. As a result of these techniques, Tornado has far better performance characteristics, particularly for multithreaded applications, than existing commercial operating systems. Tornado has been fully implemented and runs both on Toronto's NUMAchine hardware and on the SimOS simulator.

[1]  J. Davenport Editor , 1960 .

[2]  Ken Thompson,et al.  The UNIX time-sharing system , 1974, CACM.

[3]  David R. Cheriton An experiment using registers for fast message-based interprocess communication , 1984, OPSR.

[4]  Editors , 1986, Brain Research Bulletin.

[5]  Bruce J. Walker,et al.  The LOCUS Distributed System Architecture , 1986 .

[6]  William J. Bolosky,et al.  Mach: A New Kernel Foundation for UNIX Development , 1986, USENIX Summer.

[7]  John K. Bennett The design and implementation of distributed Smalltalk , 1987, OOPSLA 1987.

[8]  David R. Cheriton,et al.  The V distributed system , 1988, CACM.

[9]  Andrew R. Cherenson,et al.  The Sprite network operating system , 1988, Computer.

[10]  Lawrence C. Stewart,et al.  Firefly: a multiprocessor workstation , 1987, IEEE Trans. Computers.

[11]  Brian N. Bershad,et al.  Lightweight remote procedure call , 1989, TOCS.

[12]  Richard F. Rashid,et al.  Zone Garbage Collection , 1990, USENIX MACH Symposium.

[13]  James R. Larus,et al.  Cache considerations for multiprocessor programmers , 1990, CACM.

[14]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[15]  David L. Black Scheduling support for concurrency and parallelism in the Mach operating system , 1990, Computer.

[16]  Lawrence M. Ruane Process Synchronization in the UTS Kernel , 1990, Comput. Syst..

[17]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[18]  Alan Langerman,et al.  The OSF/1 UNIX Filesystem (UFS) , 1991, USENIX Winter.

[19]  Gerard J. Holzmann,et al.  Process Sleep and Wakeup on a Shared-memory Multiprocessor , 1991 .

[20]  Anna R. Karlin,et al.  Empirical studies of competitve spinning for a shared-memory multiprocessor , 1991, SOSP '91.

[21]  Anoop Gupta,et al.  The impact of operating system scheduling policies and synchronization methods of performance of parallel applications , 1991, SIGMETRICS '91.

[22]  Larry L. Peterson,et al.  The x-Kernel: An Architecture for Implementing Network Protocols , 1991, IEEE Trans. Software Eng..

[23]  Maurice Herlihy,et al.  Lock-free garbage collection for multiprocessors , 1991, SPAA '91.

[24]  John Slice,et al.  The Parallelization of UNIX System V Release 4.0 , 1991, USENIX Winter.

[25]  David L. Black,et al.  Locking and Reference Counting in the Mach Kernel , 1991, ICPP.

[26]  Paul R. Wilson,et al.  Uniprocessor Garbage Collection Techniques , 1992, IWMM.

[27]  Calton Pu,et al.  A Lock-Free Multiprocessor OS Kernel , 1992, OPSR.

[28]  Ramesh Balan,et al.  A Scalable Implementation of Virtual Memory HAT Layer for Shared Memory Multiprocessor Machines , 1992, USENIX Summer.

[29]  Bryan S. Rosenburg,et al.  Experience porting Mach to the RP3 large-scale shared-memory multiprocessor , 1992, Future Gener. Comput. Syst..

[30]  Helen Custer,et al.  Inside Windows NT , 1992 .

[31]  Josep Torrellas,et al.  Characterizing the caching and synchronization performance of a multiprocessor operating system , 1992, ASPLOS V.

[32]  Viktor Prasanna,et al.  Proceedings of the 6th International Parallel Processing Symposium , 1992 .

[33]  J. Kent Peacock,et al.  Experiences from multithreading System V Release 4 , 1992 .

[34]  Brian N. Bershad,et al.  The increasing irrelevance of IPC Performance for Micro-kernel-Based Operating Systems , 1992, USENIX Workshop on Microkernels and Other Kernel Architectures.

[35]  Graham Hamilton,et al.  The Spring Nucleus: A Microkernel for Objects , 1993 .

[36]  Dirk Grunwald,et al.  Improving the cache locality of memory allocation , 1993, PLDI '93.

[37]  Paul E. McKenney,et al.  Efficient Kernel Memory Allocation on Shared-Memory Multiprocessors , 1993, USENIX Winter.

[38]  Mohan Krishnan,et al.  Pitfalls in Multithreading SVR4 STREAMS and Other Weightless Processes , 1993, USENIX Winter.

[39]  Brian N. Bershad,et al.  The impact of operating system structure on memory system performance , 1994, SOSP '93.

[40]  Michael L. Scott,et al.  Kernel-Kernel communication in a shared-memory multiprocessor , 1993, Concurr. Pract. Exp..

[41]  Jochen Liedtke,et al.  Improving IPC by kernel design , 1994, SOSP '93.

[42]  Maurice Herlihy,et al.  A methodology for implementing highly concurrent data objects , 1993, TOPL.

[43]  David L. Black,et al.  An OSF/1 UNIX for Massively Parallel Multicomputers , 1993, USENIX Winter.

[44]  Beng-Hong Lim,et al.  Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.

[45]  Michael Stumm,et al.  Optimizing IPC Performance for Shared-Memory Multiprocessors , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[46]  Erik Hagersten,et al.  Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[47]  Jay Lepreau,et al.  Evolving Mach 3.0 to A Migrating Thread Model , 1994, USENIX Winter.

[48]  Michael Stumm,et al.  Experiences with locking in a NUMA multiprocessor operating system kernel , 1994, OSDI '94.

[49]  Josep Torrellas,et al.  False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[50]  Jeff Bonwick,et al.  The Slab Allocator: An Object-Caching Kernel Memory Allocator , 1994, USENIX Summer.

[51]  Michael Stumm,et al.  The Alloc Stream Facility: a redesign of application-level stream I/O , 1994, Computer.

[52]  Michael N. Nelson,et al.  An overview of the Spring system , 1994, Proceedings of COMPCON '94.

[53]  Michael Stumm,et al.  Hfs: a flexible file system for shared-memory multiprocessors , 1994 .

[54]  Mesaac Makpangou,et al.  Fragmented Objects for Distributed Abstractions , 1994 .

[55]  M. van Steen,et al.  An Object Model for Flexible Distributed Systems , 1995 .

[56]  James R. Goodman,et al.  An Analysis of the Interactions of Overhead-Reducing Techniques for Shared-Memory Multiprocessors , 1995 .

[57]  David M. Fenwick,et al.  The AlphaServer 8000 Series: High-end Server Platform Development , 1995, Digit. Tech. J..

[58]  Blaine D. Gaither Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, Ottawa, Canada, May 15-19, 1995 , 1995, SIGMETRICS.

[59]  Michael Stumm,et al.  (De-)clustering objects for multiprocessor system software , 1995, Proceedings of International Workshop on Object Orientation in Operating Systems.

[60]  David R. Cheriton,et al.  A caching model of operating system kernel functionality , 1995, OPSR.

[61]  Anoop Gupta,et al.  The impact of architectural trends on operating system performance , 1995, SOSP.

[62]  Anoop Gupta,et al.  Memory system performance of UNIX on CC-NUMA multiprocessors , 1995, SIGMETRICS '95/PERFORMANCE '95.

[63]  Anoop Gupta,et al.  Hive: fault containment for shared-memory multiprocessors , 1995, SOSP.

[64]  Susan J. Eggers,et al.  Reducing false sharing on shared memory multiprocessors through compile time data transformations , 1995, PPOPP '95.

[65]  John D. Valois Lock-free linked lists using compare-and-swap , 1995, PODC '95.

[66]  Dawson R. Engler,et al.  Exokernel: an operating system architecture for application-level resource management , 1995, SOSP.

[67]  Paul R. Wilson,et al.  Dynamic Storage Allocation: A Survey and Critical Review , 1995, IWMM.

[68]  Jacques Talbot Turning the AIX Operating System into an MP-capable OS , 1995, USENIX.

[69]  Richard L. Wexelblat,et al.  Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , 1995, PPoPP 1995.

[70]  John David Valois Lock-free data structures , 1996 .

[71]  David R. Cheriton,et al.  The synergy between non-blocking synchronization and operating system structure , 1996, OSDI '96.

[72]  Kourosh Gharachorloo,et al.  Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[73]  Josep Torrellas,et al.  Improving the data cache performance of multiprocessor operating systems , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[74]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[75]  Scott Devine,et al.  Disco: running commodity operating systems on scalable multiprocessors , 1997, TOCS.

[76]  Michael Burrows,et al.  Eraser: a dynamic data race detector for multi-threaded programs , 1997, TOCS.

[77]  Hubertus Franke,et al.  Customization Lite , 1997 .

[78]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[79]  Drew McCrocklin Scaling Solaris for Enterprise Computing , 1997 .

[80]  Trent Jaeger,et al.  Achieved IPC Performance , 1997 .

[81]  Scott Devine,et al.  Using the SimOS machine simulator to study complex computer systems , 1997, TOMC.

[82]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, TOCS.

[83]  Michael L. Scott,et al.  Scheduler-conscious synchronization , 1997, TOCS.

[84]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[85]  Trevor N. Mudge,et al.  A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.

[86]  Guy Lemieux,et al.  Design and implementation of the NUMAchine multiprocessor , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[87]  Marc Shapiro,et al.  An implementation of complete, asynchronous, distributed garbage collection , 1998, PLDI '98.

[88]  Jonathan Appavoo,et al.  Clustered Objects: Initial Design, Implementation and Evaluation , 1998 .

[89]  Robert Gray,et al.  Dynamic C++ Classes - A Lightweight Mechanism to Update Code in a Running Program , 1998, USENIX Annual Technical Conference.

[90]  Tarek S. Abdelrahman,et al.  Locality Enhancement for Large-Scale Shared-Memory Multiprocessors , 1998, LCR.

[91]  A Look at Several Memory Management Units, TLB-Refill Mechanisms, and Page Table Organizations , 1998, ASPLOS.

[92]  Guy Lemieux,et al.  The NUMAchine multiprocessor , 2000, Proceedings 2000 International Conference on Parallel Processing.