An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Page-based software Distributed Shared Memory (sDSM) systems suffer from their high memory consistency costs. Utilizing an effective prefetch technique can reduce this overhead. However, it is hard to predict accurately for applications exhibiting dynamic memory accessing and paging behavior. In this paper, we use Intel Cluster OpenMP (CLOMP) to study this problem. First, we present a stride augmented run-length encoding (sRLE) method to reconstruct series of numbers into 2D rectangles which facilitates a more accurate paging behavior analysis. Historical page miss records of OpenMP parallel and sequential regions are reconstructed and compressed by sRLE. Second, we design and implement a dynamic page prefetch technique (DReP) based on these reconstructed records to predict and issue prefetches. DReP and its implementation are evaluated through simulations and experiments. The simulation results show that DReP significantly improves the efficiency (~34%) and coverage (~47%) of existing prefetch techniques. Moreover, the experimental results show that DReP significantly reduces the memory consistency costs of CLOMP by 86% for extreme false sharing scenario. With the assistance of sRLE, DReP reduces ~45% and ~38% memory consistency costs for LINPACK and NPB-OMP benchmarks on GigE and DDR IB networks respectively. An detailed breakdown analysis shows that the introduced software overhead of DReP is negligible (~2%).

[1]  Dirk Schmidl,et al.  First Experiences with Intel Cluster OpenMP , 2008, IWOMP.

[2]  Alistair P. Rendell,et al.  Region-Based Prefetch Techniques for Software Distributed Shared Memory Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[3]  Willy Zwaenepoel,et al.  Munin: distributed shared memory based on type-specific memory coherence , 1990, PPOPP '90.

[4]  Luís Moura Silva,et al.  Implementing distributed shared memory on top of MPI: the DSMPI library , 1996, Proceedings of 4th Euromicro Workshop on Parallel and Distributed Processing.

[5]  Alistair P. Rendell,et al.  Integrating software distributed shared memory and message passing programming , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[6]  Weisong Shi,et al.  JIAJIA: A Software DSM System Based on a New Cache Coherence Protocol , 1999, HPCN Europe.

[7]  Feipei Lai,et al.  Adsmith: an efficient object-based distributed shared memory system on PVM , 1996, Proceedings Second International Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN'96).

[8]  Miguel Castro,et al.  Distributed shared object memory , 1993, Proceedings of IEEE 4th Workshop on Workstation Operating Systems. WWOS-III.

[9]  Ken Kennedy,et al.  An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[10]  Martin Burtscher,et al.  Delphi: Predition-based Page Prefetching to Improve the Performance of Shared Virtual Memory Systems , 2002, PDPTA.

[11]  Ricardo Bianchini,et al.  Data prefetching for software DSMs , 1998, ICS '98.

[12]  Ii C. D. Callahan A global approach to detection of parallelism , 1987 .

[13]  Alistair P. Rendell,et al.  Micro-benchmarks for Cluster OpenMP Implementations: Memory Consistency Costs , 2008, IWOMP.

[14]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[15]  Alan L. Cox,et al.  Tradeoffs between false sharing and aggregation in software distributed shared memory , 1997, PPOPP '97.

[16]  Alistair P. Rendell,et al.  Performance models for Cluster-enabled OpenMP implementations , 2008, 2008 13th Asia-Pacific Computer Systems Architecture Conference.

[17]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[18]  Weisong Shi,et al.  Home Migration in Home-Based Software DSMs , 1999 .

[19]  Ricardo Bianchini,et al.  Hiding communication latency and coherence overhead in software DSMs , 1996, ASPLOS VII.

[20]  John K. Bennett,et al.  Brazos: a third generation DSM system , 1997 .

[21]  Kai Li,et al.  IVY: A Shared Virtual Memory System for Parallel Computing , 1988, ICPP.

[22]  Michael Frumkin,et al.  The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .