Exploring memory consistency for massively-threaded throughput-oriented processors

We re-visit the issue of hardware consistency models in the new context of massively-threaded throughput-oriented processors (MTTOPs). A prominent example of an MTTOP is a GPGPU, but other examples include Intel's MIC architecture and some recent academic designs. MTTOPs differ from CPUs in many significant ways, including their ability to tolerate latency, their memory system organization, and the characteristics of the software they run. We compare implementations of various hardware consistency models for MTTOPs in terms of performance, energy-efficiency, hardware complexity, and programmability. Our results show that the choice of hardware consistency model has a surprisingly minimal impact on performance and thus the decision should be based on hardware complexity, energy-efficiency, and programmability. For many MTTOPs, it is likely that even a simple implementation of sequential consistency is attractive.

[1]  Peter Sewell,et al.  A Better x86 Memory Model: x86-TSO , 2009, TPHOLs.

[2]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[3]  Mike O'Connor,et al.  Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[4]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[5]  Andrew Brownsword,et al.  Hardware transactional memory for GPU architectures , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Jeremy Manson,et al.  The Java memory model , 2005, POPL '05.

[7]  Keshav Pingali,et al.  Lonestar: A suite of parallel irregular programs , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[8]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[9]  Robert Sims,et al.  Alpha architecture reference manual , 1992 .

[10]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[11]  Daniel J. Sorin,et al.  Evaluating cache coherent shared virtual memory for heterogeneous multicore chips , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[12]  James R. Larus,et al.  Transactional Memory , 2006, Transactional Memory.

[13]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Mark D. Hill,et al.  Multiprocessors Should Support Simple Memory-Consistency Models , 1998, Computer.

[15]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Sanjay J. Patel,et al.  Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[17]  James R. Goodman,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, MICRO.

[18]  Sanjay J. Patel,et al.  Cohesion: a hybrid memory model for accelerators , 2010, ISCA.

[19]  Steve Keckler,et al.  Proceedings of the 36th annual international symposium on Computer architecture , 2009, ISCA 2009.

[20]  Albert Meixner,et al.  Dynamic verification of sequential consistency , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[21]  Milo M. K. Martin,et al.  Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[22]  Arvind,et al.  Memory Model = Instruction Reordering + Store Atomicity , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[23]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[24]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[25]  Mikko H. Lipasti,et al.  Memory ordering: a value-based approach , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[26]  Sarita V. Adve,et al.  Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models , 1997, SPAA '97.

[27]  James R. Larus,et al.  Transactional memory , 2008, CACM.

[28]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[29]  Hans-Juergen Boehm,et al.  Foundations of the C++ concurrency memory model , 2008, PLDI '08.

[30]  Reuben Tozman Going Mainstream , 2012, ELERN.

[31]  T. N. Vijaykumar,et al.  Is SC + ILP = RC? , 1999, ISCA.

[32]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[33]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[34]  Babak Falsafi,et al.  Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.