Apple-CORE: Microgrids of SVP Cores -- Flexible, General-Purpose, Fine-Grained Hardware Concurrency Management

To harness the potential of CMPs for scalable, energy-efficient performance in general-purpose computers, the Apple-CORE project has co-designed a general machine model and concurrency control interface with dedicated hardware support for concurrency management across multiple cores. Its SVP interface combines dataflow synchronisation with imperative programming, towards the efficient use of parallelism in general-purpose workloads. The corresponding hardware implementation provides logic able to coordinate single-issue, in-order multi-threaded RISC cores into computation clusters on chip, called Microgrids. In contrast with the traditional "accelerator" approach, Microgrids are intended to be used as components in distributed systems on chip that consider both clusters of small cores and optional larger cores optimized towards sequential performance as system services shared between applications. The key aspects of the design are asynchrony, i.e. the ability to tolerate operations with irregular long latencies, a scale-invariant programming model, a distributed vision of the chip's structure, and the transparent performance scaling of a single program binary code across multiple cluster sizes. This paper describes the execution model, the core micro-architecture, its realization in a many-core, general-purpose processor chip and its software environment. The reference chip parameters include 128 cores, a 4 MB on-chip distributed cache network and four DDR3-1600 memory channels. This paper presents cycle-accurate simulation results for various key algorithmic and cryptographic kernels. The results show good efficiency in terms of the utilization of hardware despite the high-latency memory accesses and good scalability across relatively large clusters of cores.

[1]  Edsger W. Dijkstra,et al.  Hierarchical ordering of sequential processes , 1971, Acta Informatica.

[2]  Chuang Lin,et al.  NPCryptBench: a cryptographic benchmark suite for network processors , 2006, SIGARCH Comput. Archit. News.

[3]  J. E. Thornton,et al.  Parallel operation in the control data 6600 , 1964, AFIPS '64 (Fall, part II).

[4]  Norman P. Jouppi,et al.  CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[5]  David May,et al.  Occam and the transputer , 1988, European Workshop on Applications and Theory in Petri Nets.

[6]  David A. Koufaty,et al.  Hyperthreading Technology in the Netburst Microarchitecture , 2003, IEEE Micro.

[7]  Chris R. Jesshope,et al.  The Verification of the On-Chip COMA Cache Coherence Protocol , 2008, AMAST.

[8]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[9]  Chuang Lin,et al.  Optimization and benchmark of cryptographic algorithms on network processors , 2003, SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

[10]  Michael A. Hicks,et al.  Towards scalable I/O on a many-core architecture , 2010, 2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[11]  Chris R. Jesshope,et al.  An Operating System Strategy for General-purpose Parallel Computing on Many-core Architectures , 2010, High Performance Computing Workshop.

[12]  Santosh G. Abraham,et al.  Chip multithreading: opportunities and challenges , 2005, 11th International Symposium on High-Performance Computer Architecture.

[13]  Jörg Henkel,et al.  On-chip networks: a scalable, communication-centric embedded system design paradigm , 2004, 17th International Conference on VLSI Design. Proceedings..

[14]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[15]  James E. Thomton,et al.  Parallel Operation in the Control Data 6600 , 1899 .

[16]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[17]  Avi Mendelson,et al.  Coming challenges in microarchitecture and architecture , 2001, Proc. IEEE.

[18]  D. Marr,et al.  Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[19]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[20]  Larry Carter,et al.  Multi-processor Performance on the Tera MTA , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[21]  Chris R. Jesshope,et al.  On-Chip COMA Cache-Coherence Protocol for Microgrids of Microthreaded Cores , 2007, Euro-Par Workshops.

[22]  George Manis,et al.  Run-Time Scheduling with the C2muTC/SL Parallelizing Compiler , 2011, ARCS Workshops.

[23]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[24]  George Neville-Neil,et al.  The Design and Implementation of the FreeBSD Operating System , 2014 .

[25]  Axel Jantsch,et al.  A network on chip architecture and design methodology , 2002, Proceedings IEEE Computer Society Annual Symposium on VLSI. New Paradigms for VLSI Systems Design. ISVLSI 2002.

[26]  Herb Sutter,et al.  The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software , 2013 .

[27]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[28]  Thomas E. Anderson,et al.  The performance implications of thread management alternatives for shared-memory multiprocessors , 1989, SIGMETRICS '89.

[29]  Thomas A. Ziaja,et al.  Sparc T4: A Dynamically Threaded Server-on-a-Chip , 2012, IEEE Micro.

[30]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[31]  M. Irfan Uddin,et al.  Heterogeneous integration to simplify many-core architecture simulations , 2012, RAPIDO '12.

[32]  Orlando Moreira,et al.  Online resource management in a multiprocessor with a network-on-chip , 2007, SAC '07.

[33]  Chris R. Jesshope,et al.  Supporting Microthread Scheduling and Synchronisation in CMPs , 2006, International Journal of Parallel Programming.

[34]  Clemens Grelck,et al.  SAC—A Functional Array Language for Efficient Multi-threaded Execution , 2006, International Journal of Parallel Programming.

[35]  R. S. Nikhil Can dataflow subsume von Neumann computing? , 1989, ISCA '89.

[36]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[37]  Clemens Grelck,et al.  Compiling the functional data-parallel language SaC for Microgrids of Self-Adaptive Virtual Processors , 2009 .