Optimizing message-passing on multicore architectures using hardware multi-threading

Shared-memory and message-passing are two opposite models to develop parallel computations. The shared-memory model, adopted by existing frameworks such as OpenMP, represents a de-facto standard on multi-/many-core architectures. However, message-passing deserves to be studied for its inherent properties in terms of portability and flexibility as well as for its better ease of debugging. Achieving good performance from the use of messages in shared-memory architectures requires an efficient implementation of the run-time support. This paper investigates the definition of a delegation mechanism on multi-threaded architectures able to: (i) overlap communications with calculation phases, (ii) parallelize distribution and collective operations. Our ideas have been exemplified using two parallel benchmarks on the Intel Phi, showing that in these applications our message-passing support outperforms MPI and reaches similar performance compared to standard OpenMP implementations.

[1]  Cezary Dubnicki,et al.  VMMC-2 : Efficient Support for Reliable, Connection-Oriented Communication , 1997 .

[2]  John Paul Shen,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[3]  Brian Armstrong,et al.  Quantifying Differences between OpenMP and MPI Using a Large-Scale Application Suite , 2000, ISHPC.

[4]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[5]  Andrew Lumsdaine,et al.  Partial globalization of partitioned address spaces for zero-copy communication with shared memory , 2011, 2011 18th International Conference on High Performance Computing.

[6]  Rupak Biswas,et al.  The impact of hyper-threading on processor resource utilization in production applications , 2011, 2011 18th International Conference on High Performance Computing.

[7]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[8]  Matthew Curtis-Maury,et al.  Integrating multiple forms of multithreaded execution on multi-SMT systems: a study with scientific applications , 2005, Second International Conference on the Quantitative Evaluation of Systems (QEST'05).

[9]  Géraud Krawezik,et al.  Performance comparison of MPI and three openMP programming styles on shared memory multiprocessors , 2003, SPAA '03.

[10]  Gabriele Mencagli,et al.  EVALUATION OF ARCHITECTURAL SUPPORTS FOR FINE-GRAINED SYNCHRONIZATION MECHANISMS , 2013 .

[11]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[12]  Xingfu Wu,et al.  Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers , 2011, PERV.

[13]  Bernard Tourancheau,et al.  BIP-SMP : High Performance Message Passing over a Cluster of Commodity SMPs , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[14]  George Bosilca,et al.  Kernel-assisted and topology-aware MPI collective communications on multicore/many-core platforms , 2013, J. Parallel Distributed Comput..

[15]  Nectarios Koziris,et al.  Overlapping computation and communication in SMT clusters with commodity interconnects , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[16]  Michael Allen,et al.  Parallel programming: techniques and applications using networked workstations and parallel computers , 1998 .

[17]  Stephen Jenks,et al.  Architectural support for thread communications in multi-core processors , 2011, Parallel Comput..

[18]  Hiroshi Tezuka,et al.  The design and implementation of zero copy MPI using commodity hardware with a high performance network , 1998, ICS '98.

[19]  Dhabaleswar K. Panda,et al.  ProOnE: a general-purpose protocol onload engine for multi- and many-core architectures , 2009, Computer Science - Research and Development.

[20]  Torsten Hoefler,et al.  A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.