Exploiting Staleness for Approximating Loads on CMPs

Coherence misses are an important factor in limiting the scalability of multi-threaded shared memory applications on chip multiprocessors (CMPs) that are envisaged to contain dozens of cores in the imminent future. This paper proposes a novel approach to tackling this problem by leveraging the growingly important paradigm of approximate computing. Many applications are either tolerant to slight errors in the output or if stringent, have in-built resiliency to tolerate some errors in the execution. The approximate computing paradigm suggests breaking conventional barriers of mandating stringent correctness on the hardware, allowing more flexibility in the performance-power-reliability design space. Taking the multi-threaded applications in the SPLASH-2 benchmark suite, we note that nearly all these applications have such inherent resiliency and/or tolerance to slight errors in the output. Based on this observation, we propose to approximate coherence-related load misses by returning stale values, i.e., the version at the time of the invalidation. We show that returning such values from the invalidated lines already present in d-L1 offers only limited scope for improvement since those lines get evicted fairly soon due to the high pressure on d-L1. Instead, we propose a very small (8 lines) Stale Victim Cache (SVC), to hold such lines upon d-L1 eviction. While this does offer significant improvement, there is the possibility of data getting very stale in such a structure, making it highly sensitive to the choice of what data to keep, and for how long. To address these concerns, we propose to time-out these lines from the SVC to limit their staleness in a mechanism called SVC+TB. We show that SVC+TB provides as much as 28.6% speedup in some SPLASH-2 applications, with an average speedup between 10-15% across the entire suite, becoming comparable to an ideal execution that does not incur coherence misses. Further, the consequent approximations have little impact on the correctness, allowing all of them to complete. There were no errors, because of inherent application resilience, in eleven applications, and the maximum error was at most 0.08% across the entire suite.

[1]  Mario Badr,et al.  Load Value Approximation , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[2]  T. N. Vijaykumar,et al.  Is SC + ILP = RC? , 1999, ISCA.

[3]  Surendra Byna,et al.  Exploiting the forgiving nature of applications for scalable parallel execution , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[4]  Seunghak Lee,et al.  Solving the Straggler Problem with Bounded Staleness , 2013, HotOS.

[5]  Ion Stoica,et al.  Probabilistically Bounded Staleness for Practical Partial Quorums , 2012, Proc. VLDB Endow..

[6]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[7]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[8]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[9]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[10]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[11]  Milo M. K. Martin,et al.  Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[12]  Zeyuan Allen Zhu,et al.  Randomized accuracy-aware program transformations for efficient approximate computations , 2012, POPL '12.

[13]  Scott A. Mahlke,et al.  SAGE: Self-tuning approximation for graphics engines , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Per Stenström,et al.  Reducing the Write Traffic for a Hybrid Cache Protocol , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[15]  Anna R. Karlin,et al.  Competitive snoopy caching , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[16]  Seunghak Lee,et al.  Exploiting Bounded Staleness to Speed Up Big Data Analytics , 2014, USENIX Annual Technical Conference.

[17]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[18]  David J. Lilja,et al.  Using stochastic computing to implement digital image processing algorithms , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[19]  Henry Hoffmann,et al.  Dynamic knobs for responsive power-aware computing , 2011, ASPLOS XVI.

[20]  Luis Ceze,et al.  Architecture support for disciplined approximate programming , 2012, ASPLOS XVII.

[21]  Josep Torrellas,et al.  Distance-adaptive update protocols for scalable shared-memory multiprocessors , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[22]  Krishna V. Palem,et al.  Probabilistic CMOS Technology: A Survey and Future Directions , 2006, 2006 IFIP International Conference on Very Large Scale Integration.

[23]  Kaushik Roy,et al.  Analysis and characterization of inherent application resilience for approximate computing , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[24]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[25]  Brian R. Gaines,et al.  Stochastic computing , 1967, AFIPS '67 (Spring).

[26]  Daniel M. Roy,et al.  Probabilistically Accurate Program Transformations , 2011, SAS.

[27]  Jaehyuk Huh,et al.  Coherence decoupling: making use of incoherence , 2004, ASPLOS XI.

[28]  Per Stenström,et al.  An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic , 1994, PARLE.

[29]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[30]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[31]  Donald Yeung,et al.  Application-Level Correctness and its Impact on Fault Tolerance , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[32]  Woongki Baek,et al.  Green: a framework for supporting energy-conscious programming using controlled approximation , 2010, PLDI '10.

[33]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[34]  Henry Hoffmann,et al.  Managing performance vs. accuracy trade-offs with loop perforation , 2011, ESEC/FSE '11.

[35]  Onur Mutlu,et al.  Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks , 2014, ACM Trans. Archit. Code Optim..

[36]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[37]  Kaushik Roy,et al.  Dynamic effort scaling: Managing the quality-efficiency tradeoff , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[38]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[39]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[40]  Alberto Ros,et al.  A Direct Coherence Protocol for Many-Core Chip Multiprocessors , 2010, IEEE Transactions on Parallel and Distributed Systems.