With abundant transistors but limited energy budgets, chip designs have trended towards multiple cores and specialised logic which is used infrequently. This approach allows computer architects to sidestep the utilisation wall: the idea that we can place more transistors on a chip than we can use simultaneously. This suggests that transistors’ functions should be specialised, so only a small fraction of the chip need be active at a time. However, as trade-offs continue to change, this approach will become less effective. Increasing heterogeneity increases complexity, and this makes it harder to validate the chip’s design; harder to generate optimised code; and harder to protect against hardware faults. Furthermore, beyond 28nm, we can no longer assume that smaller transistors will always be cheaper, so we cannot continue to provide dedicated logic which will be used infrequently. Instead, we propose switching to a homogeneous approach, and implementing the necessary specialisation in software. Having a single computation unit which is repeated many times reduces complexity and so makes the problems of validation, compilation and fault tolerance easier to solve. Homogeneous systems have the additional advantage that they are general-purpose, so a wider range of applications can be usefully accelerated. The challenge then becomes: how do we make use of all the available processors? A thread-based approach will only get us so far. Thread-level parallelism (TLP) is only abundant in a small fraction of code, and TLP in general applications has remained stubbornly low [3]. Instead, we show that if communication between cores is low-latency and low-energy, large numbers of them can be grouped together at run-time to implement a virtual architecture optimised for a particular application. This virtual architecture can be given the ideal cache capacity, communication structure and number of functional units to execute a task efficiently. Since the underlying architecture is homogeneous, there is also scope for dynamically varying the resources allocated, depending on circumstances such as contention, priority and power budget.
[1]
Hafizur Rahaman,et al.
VLSI Design, 2007
,
2007
.
[2]
Seth H. Pugsley,et al.
SWEL: Hardware cache coherence protocols to map shared data onto shared caches
,
2010,
2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[3]
Antonia Zhai,et al.
Triggered instructions: a control paradigm for spatially-programmed architectures
,
2013,
ISCA.
[4]
Daniel Bates,et al.
Exploiting Tightly-Coupled Cores
,
2015,
J. Signal Process. Syst..
[5]
Krisztián Flautner,et al.
Evolution of thread-level parallelism in desktop applications
,
2010,
ISCA.
[6]
Xiaochen Zhou,et al.
A Case for Software Managed Coherence in Many-core Processors
,
2010
.
[7]
Kees G. W. Goossens,et al.
Avoiding Message-Dependent Deadlock in Network-Based Systems on Chip
,
2007,
VLSI Design.
[8]
Huang He.
Architecture Supported Synchronization-Based Cache Coherence Protocol for Many-Core Processors
,
2009
.
[9]
Krste Asanovic,et al.
Convergence and scalarization for data-parallel architectures
,
2013,
Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).