The Case for Semi-Permanent Cache Occupancy: Understanding the Impact of Data Locality on Network Processing

The performance critical path for MPI implementations relies on fast receive side operation, which in turn requires fast list traversal. The performance of list traversal is dependent on data-locality; whether the data is currently contained in a close-to-core cache due to its temporal locality or if its spacial locality allows for predictable pre-fetching. In this paper, we explore the effects of data locality on the MPI matching problem by examining both forms of locality. First, we explore spacial locality, by combining multiple entries into a single linked list element, we can control and modify this form of locality. Secondly, we explore temporal locality by utilizing a new technique called "hot caching", a process that creates a thread to periodically access certain data, increasing its temporal locality. In this paper, we show that by increasing data locality, we can improve MPI performance on a variety of architectures up to 4x for micro-benchmarks and up to 2x for an application.

[1]  Ahmad Afsahi,et al.  A fast and resource-conscious MPI message queue mechanism for large-scale jobs , 2014, Future Gener. Comput. Syst..

[2]  George Bosilca,et al.  A survey of MPI usage in the US exascale computing project , 2018, Concurr. Comput. Pract. Exp..

[3]  SkjellumAnthony,et al.  A high-performance, portable implementation of the MPI message passing interface standard , 1996 .

[4]  Bruce Jacob,et al.  The structural simulation toolkit , 2006, PERV.

[5]  Jeffrey S. Vetter,et al.  An Empirical Performance Evaluation of Scalable Scientific Applications , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[6]  Holger Fröning,et al.  Relaxations for High-Performance Message Passing on Massively Parallel SIMT Processors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[7]  Kevin T. Pedretti,et al.  Instrumentation and Analysis of MPI Queue Times on the SeaStar High-Performance Network , 2008, 2008 Proceedings of 17th International Conference on Computer Communications and Networks.

[8]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[9]  Jean-Pierre Panziera,et al.  The BXI Interconnect Architecture , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[10]  Kevin T. Pedretti,et al.  Characterizing MPI matching via trace-based simulation , 2017, EuroMPI/USA.

[11]  Keith D. Underwood,et al.  Intel® Omni-path Architecture: Enabling Scalable, High Performance Fabrics , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[12]  Brian W. Barrett,et al.  The Portals 4.3 Network Programming Interface , 2014 .

[13]  Keith D. Underwood,et al.  A preliminary analysis of the MPI queue characterisitics of several applications , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[14]  Dhabaleswar K. Panda,et al.  Adaptive and Dynamic Design for MPI Tag Matching , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[15]  Jack Dongarra,et al.  Recent Advances in the Message Passing Interface - 17th European MPI Users' Group Meeting, EuroMPI 2010, Stuttgart, Germany, September 12-15, 2010. Proceedings , 2010, EuroMPI.

[16]  Simon D. Hammond,et al.  An evaluation of MPI message rate on hybrid-core processors , 2014, Int. J. High Perform. Comput. Appl..

[17]  Keith D. Underwood,et al.  The impact of MPI queue usage on message latency , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[18]  Stephen L. Olivier,et al.  Toward an evolutionary task parallel integrated MPI + X programming model , 2015, PMAM@PPoPP.

[19]  Kevin B. McGrattan,et al.  Fire dynamics simulator technical reference guide volume 1 :: mathematical model , 2013 .

[20]  Karl S. Hemmert,et al.  A hardware acceleration unit for MPI queue processing , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[21]  Robert D. Falgout,et al.  Multigrid Smoothers for Ultraparallel Computing , 2011, SIAM J. Sci. Comput..

[22]  Keith D. Underwood,et al.  Enhancing NIC performance for MPI using processing-in-memory , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[23]  Keith D. Underwood,et al.  An analysis of NIC resource usage for offloading MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[24]  Keith D. Underwood,et al.  Mitigating MPI Message Matching Misery , 2016, ISC.

[25]  Richard L. Graham,et al.  Characteristics of the Unexpected Message Queue of MPI Applications , 2010, EuroMPI.