Performance analysis on a CC-NUMA prototype

Cache-coherent nonuniform memory access (CC-NUMA) machines have been shown to be a promising paradigm for exploiting distributed execution. CC-NUMA systems can provide performance typically associated with parallel machines, without the high cost associated with parallel programming. This is because a single image of memory is provided on a CC-NUMA machine. Past research on CC-NUMA machines has focused on modifications to the memory hierarchy, interconnect topology, and memory consistency protocols, which are all areas critical to achieving scalable performance. The research described here expands this focus to issues associated with operating system structures which can increase system scalability. We describe a hardware/software prototyping study which investigates how changes to the operating system of a commercial IBM AS/400® system can provide scalable performance when running transaction processing workloads. The project described was a joint effort between researchers at the IBM Thomas J. Wats on Research Center and a team from the AS/400 development laboratory in Rochester, Minnesota. This paper describes various aspects of the project, including changes made to the operating system to enable scalable performance, and the associated hardware and software performance tools developed to identify bottlenecks in the existing operating system structures.

[1]  William Berg,et al.  Lessons learned from the OS/400 OO project , 1995, CACM.

[2]  Yale N. Patt,et al.  A Comparison Of Dynamic Branch Predictors That Use Two Levels Of Branch History , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[3]  Michael J. Flynn,et al.  Very high-speed computing systems , 1966 .

[4]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[5]  Zarka Cvetanovic,et al.  Characterization of Alpha AXP performance using TP and SPEC workloads , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[6]  Mark S. Squillante,et al.  Analytic models of workload behavior and pipeline performance , 1997, Proceedings Fifth International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[7]  John L. Hennessy,et al.  The accuracy of trace-driven simulations of multiprocessors , 1993, SIGMETRICS '93.

[8]  Marc Levoy,et al.  Parallel visualization algorithms: performance and architectural implications , 1994, Computer.

[9]  Murthy V. Devarakonda,et al.  Issues in implementation of cache-affinity scheduling , 1992 .

[10]  Charles Retter,et al.  Computer Architecture: A Designer''s Text Based on a Generic RISC, McGraw-Hill Computer Science Ser , 1994 .

[11]  David R. Kaeli,et al.  Issues in Trace-Driven Simulation , 1993, Performance/SIGMETRICS Tutorials.

[12]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[13]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[14]  BergWilliam,et al.  Lessons learned from the OS/400 OO project , 1995 .

[15]  David R. Kaeli,et al.  Real-Time Trace Generation , 1996, Int. J. Comput. Simul..

[16]  Susan J. Eggers,et al.  Techniques for efficient inline tracing on a shared-memory multiprocessor , 1990, SIGMETRICS '90.

[17]  Louise Trevillyan,et al.  Representative traces for processor models with infinite cache , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[18]  GuptaAnoop,et al.  Parallel Visualization Algorithms , 1994 .

[19]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[20]  Anoop Gupta,et al.  Comparative performance evaluation of cache-coherent NUMA and COMA architectures , 1992, ISCA '92.