Soft Coherence : Preliminary Experiments with Error-Tolerant Memory Consistency in Numerical Applications

As we scale into the multi-core era, we face severe challenge s in the scalability and performance of on-chip cache-coherent shared memory mechanisms. We exp lor application error-tolerance as an extra degree of freedom to meet these challenges. Iterative numerical algorithms, in particular, can cope with the occasional stale value with little or no effect on ac curacy or convergence time. We explore analysis methods to distinguish between critical and non-c ritical data in such algorithms. We exploit this distinction to designsoft coherenceprotocols that provide strong guarantees for critical data and weak guarantees for non-critical data. Our preliminary results se a conjugate gradient solver as an example, with experiments on five sparse matrices showing 6.9%-12.6% performance improvement, with little loss in precision.

[1]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[2]  Josep Torrellas,et al.  Comparing data forwarding and prefetching for communication-induced misses in shared-memory MPs , 1998, ICS '98.

[3]  Hugh Garraway Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.

[4]  Ravi Rajwar,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[5]  Josep Torrellas,et al.  Speculative Synchronization: Programmability and Performance for Parallel Codes , 2003, IEEE Micro.

[6]  Rob H. Bisseling,et al.  Communication balancing in parallel sparse matrix-vector multiplication , 2005 .

[7]  Brendan Vastenhouw,et al.  A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication , 2005, SIAM Rev..

[8]  Jarmo Rantakokko,et al.  Algorithmic optimizations of a conjugate gradient solver on shared memory architectures , 2006, Int. J. Parallel Emergent Distributed Syst..

[9]  Dongrui Fan,et al.  Architectural support for cilk computations on many-core architectures , 2009, PPoPP '09.

[10]  Huang He Architecture Supported Synchronization-Based Cache Coherence Protocol for Many-Core Processors , 2009 .

[11]  Jean Utke,et al.  Fast higher-order derivative tensors with Rapsodia , 2009, Optim. Methods Softw..

[12]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[13]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.