Using overdecomposition to overlap communication latencies with computation and take advantage of SMT processors

Parallel programs running on clusters are typically decomposed and mapped to run with one thread per processor each working on its disjoint subset of the data. We evaluate performance improvements and limitations for a micro-benchmark and the NAS benchmarks, by using overdecomposition to map multiple threads to each processor to overlap computation with communication. The experiment platform is a cluster with Pentium 4 symmetric multithreading (SMT) processor nodes interconnected through gigabit Ethernet. Micro-benchmark results demonstrate execution time improvements up to 1.8. However, for the NAS benchmarks overdecomposition and SMT provides only slight performance gains, and sometimes significant performance loss. We evaluated improvement and limitation sensitivity to problem size, communication structure and whether SMT is enabled or not. We found that performance improvements are limited by applications having communication dependencies that limit thread-level parallelism, increase in cache misses, or increased systems activity. Our study contributes a better understanding of these limitations

[1]  G. C. Fox,et al.  Solving Problems on Concurrent Processors , 1988 .

[2]  John Markus Bjørndalen,et al.  EventSpace - Exposing and Observing Communication Behavior of Parallel Cluster Applications , 2003, Euro-Par.

[3]  Dean M. Tullsen,et al.  Tuning Compiler Optimizations for Simultaneous Multithreading , 2004, International Journal of Parallel Programming.

[4]  Gregory A. Koenig,et al.  Using message-driven objects to mask latency in grid computing applications , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[5]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[6]  John Markus Bjørndalen,et al.  Collective Communication Performance Analysis Within the Communication System , 2004, Euro-Par.

[7]  D. Marr,et al.  Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[8]  A. Snavely,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[9]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[10]  Brian Vinter,et al.  Java PastSet: a structured distributed shared memory system , 2003, IEE Proc. Softw..

[11]  Mark A. Johnson,et al.  Solving problems on concurrent processors. Vol. 1: General techniques and regular problems , 1988 .

[12]  Balaram Sinharoy,et al.  IBM Power5 chip: a dual-core multithreaded processor , 2004, IEEE Micro.

[13]  Dean M. Tullsen,et al.  Supporting fine-grained synchronization on a simultaneous multithreading processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[14]  Philippe Roussel,et al.  The microarchitecture of the intel pentium 4 processor on 90nm technology , 2004 .

[15]  David A. Koufaty,et al.  Hyperthreading Technology in the Netburst Microarchitecture , 2003, IEEE Micro.

[16]  Renato J. O. Figueiredo,et al.  Impact of heterogeneity on DSM performance , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[17]  Erich M. Nahum,et al.  Evaluating the impact of simultaneous multithreading on network servers using real hardware , 2005, SIGMETRICS '05.

[18]  Susan J. Eggers,et al.  An analysis of operating system behavior on a simultaneous multithreaded architecture , 2000, ASPLOS IX.

[19]  Jack J. Dongarra,et al.  A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[20]  Ulrich Drepper,et al.  The Native POSIX Thread Library for Linux , 2002 .

[21]  Dean M. Tullsen,et al.  Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading , 1997, TOCS.

[22]  Dean M. Tullsen,et al.  Initial observations of the simultaneous multithreading Pentium 4 processor , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[23]  Brian Vinter,et al.  Past-Set - A Distributed Structured Shared Memory System , 1999, HPCN Europe.