Optimizing NANOS OpenMP for the IBM Cyclops multithreaded architecture

In this paper, we present two approaches to improve the execution of OpenMP applications on the IBM Cyclops multithreaded architecture. Both solutions are independent and they are focused to obtain better performance through a better management of the cache locality. The first solution is based on software modifications to the OpenMP runtime library to balance stack accesses across all data caches. The second solution is a small hardware modification to change the data cache mapping behavior, with the same goal. Both solutions help parallel applications to improve scalability and obtain better performance in this kind of architectures. In fact, they could also be applied to future multi-core processors. We have executed (using simulation) some of the NAS benchmarks to prove these proposals. They show how, with small changes in both the software and the hardware, we achieve very good scalability in parallel applications. Our results also show that standard execution environments oriented to multiprocessor architectures can be easily adapted to exploit multithreaded processors.

[1]  Susan J. Eggers,et al.  The effectiveness of multiple hardware contexts , 1994, ASPLOS VI.

[2]  Dean M. Tullsen,et al.  Simultaneous multithreading: a platform for next-generation processors , 1997, IEEE Micro.

[3]  Mitsuhisa Sato,et al.  Design of OpenMP Compiler for an SMP Cluster , 1999 .

[4]  Ajay K. Royyuru,et al.  Blue Gene: A vision for protein science using a petaflop supercomputer , 2001, IBM Syst. J..

[5]  Eduard Ayguadé,et al.  NanosCompiler: supporting flexible multilevel parallelism exploitation in OpenMP , 2000 .

[6]  Nader Bagherzadeh,et al.  Performance study of a multithreaded superscalar microprocessor , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[7]  Willy Zwaenepoel,et al.  OpenMP on Networks of Workstations , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[8]  Marco Zagha,et al.  OriginTM 2000 and Onyx2® Performance Tuning and Optimization Guide , 1993 .

[9]  Mario Nemirovsky,et al.  Increasing superscalar performance through multistreaming , 1995, PACT.

[10]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[11]  Eduard Ayguadé,et al.  Evaluation of OpenMP for the Cyclops Multithreaded Architecture , 2003, WOMPAT.

[12]  H. Jin,et al.  - 3-The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance , 1999 .

[13]  S. Parekh,et al.  Tuning Compiler Optimizations for Simultaneous Multithreading , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[14]  Eduard Ayguadé,et al.  A Library Implementation of the Nano-Threads Programming Model , 1996, Euro-Par, Vol. II.

[15]  Milind Girkar,et al.  Parafrase-2: an Environment for Parallelizing, Partitioning, Synchronizing, and Scheduling Programs on Multiprocessors , 1989, Int. J. High Speed Comput..

[16]  Allan Snavely,et al.  DATA INTENSIVE VOLUME VISUALIZATION ON THE TERA MTA AND CRAY T � , 1999 .

[17]  Mauricio J. Serrano,et al.  Performance estimation of multistreamed, superscalar processors , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[18]  Eduard Ayguadé,et al.  Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors , 1999, ICS '99.

[19]  Eduard Ayguadé,et al.  NanosCompiler: supporting flexible multilevel parallelism exploitation in OpenMP , 2000, Concurr. Pract. Exp..

[20]  Larry Carter,et al.  Multi-processor Performance on the Tera MTA , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[21]  Balaram Sinharoy,et al.  Design and implementation of the POWER5 microprocessor , 2004, Proceedings. 41st Design Automation Conference, 2004..

[22]  José E. Moreira,et al.  Dissecting Cyclops: a detailed analysis of a multithreaded architecture , 2003, CARN.

[23]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[24]  Constantine D. Polychronopoulos,et al.  α-coral: a multigrain, multithreaded processor architecture , 2001, ICS '01.