Evaluation of OpenMP for the Cyclops Multithreaded Architecture

Multithreaded architectures have the potential of tolerating large memory and functional unit latencies and increase resource utilization. The Blue Gene/Cyclops architecture, being developed at the IBM T. J. Watson Research Center, is one such systems that offers massive intra-chip parallelism. Although the BG/C architecture was initially designed to execute specific applications, we believe that it can be effectively used on a broad range of parallel numerical applications. Programming such applications for this unconventional design requires a significant porting effort when using the basic built-in mechanisms for thread management and synchronization. In this paper, we describe the implementation of an OpenMP environment for parallelizing applications, currently under development at the CEPBA-IBM Research Institute, targeting BG/C. The environment is evaluated with a set of simple numerical kernels and a subset of the NAS OpenMP benchmarks. We identify issues that w ere not initially considered in the design of the BG/C architecture to support a programming model such as OpenMP. We also evaluate features currently offered by the BG/C architecture that should be considered in the implementation of an efficient OpenMP layer for massive intra-chip parallel architectures.

[1]  Ajay K. Royyuru,et al.  Blue Gene: A vision for protein science using a petaflop supercomputer , 2001, IBM Syst. J..

[2]  Eduard Ayguadé,et al.  NanosCompiler: supporting flexible multilevel parallelism exploitation in OpenMP , 2000 .

[3]  P.M. Kogge,et al.  Pursuing a petaflop: point designs for 100 TF computers using PIM technologies , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[4]  Josep Torrellas,et al.  Toward a cost-effective DSM organization that exploits processor-memory integration , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[5]  Mitsuhisa Sato,et al.  Design of OpenMP Compiler for an SMP Cluster , 1999 .

[6]  Allan Snavely,et al.  DATA INTENSIVE VOLUME VISUALIZATION ON THE TERA MTA AND CRAY T � , 1999 .

[7]  Dean M. Tullsen,et al.  Supporting fine-grained synchronization on a simultaneous multithreading processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[8]  Seung-Moon Yoo,et al.  FlexRAM: toward an advanced intelligent memory system , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[9]  Katherine Yelick,et al.  A Case for Intelligent RAM: IRAM , 1997 .

[10]  Eduard Ayguadé,et al.  Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors , 1999, ICS '99.

[11]  Thomas Sterling,et al.  The Gilgamesh processor-in-memory architecture and its execution model , 2001 .

[12]  Willy Zwaenepoel,et al.  OpenMP on Networks of Workstations , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[13]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[14]  José E. Moreira,et al.  Dissecting Cyclops: a detailed analysis of a multithreaded architecture , 2003, CARN.

[15]  Larry Carter,et al.  Multi-processor Performance on the Tera MTA , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[16]  Eduard Ayguadé,et al.  A Library Implementation of the Nano-Threads Programming Model , 1996, Euro-Par, Vol. II.

[17]  Milind Girkar,et al.  Parafrase-2: an Environment for Parallelizing, Partitioning, Synchronizing, and Scheduling Programs on Multiprocessors , 1989, Int. J. High Speed Comput..

[18]  William J. Dally,et al.  A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[19]  Dean M. Tullsen,et al.  Simultaneous multithreading: a platform for next-generation processors , 1997, IEEE Micro.

[20]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[21]  José E. Moreira,et al.  Evaluation of a multithreaded architecture for cellular computing , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[22]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[23]  Dean M. Tullsen,et al.  Tuning Compiler Optimizations for Simultaneous Multithreading , 2004, International Journal of Parallel Programming.

[24]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[25]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[26]  William H. Press,et al.  Numerical recipes in C , 2002 .

[27]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[28]  Jaewook Shin,et al.  Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[29]  Constantine D. Polychronopoulos,et al.  α-coral: a multigrain, multithreaded processor architecture , 2001, ICS '01.