The hardware trend of the last 15 years of dynamically trying to improve performance with little software visibility is not only irrelevant today, its counterproductive; adaptivity must be at the software level if parallel software is going to be portable, fast, and energy-efficient. A portable parallel program is an oxymoron today; there is no reason to be parallel if it's slow, and parallel can't be fast if it's portable. Hence, portable parallel programs of the future must be able to understand and measure /any/ computer on which it runs so that it can adapt effectively, which suggests that hardware measurement should be standardized and processor performance and energy consumption should become transparent.
In addition to software-controlled adaptivity for execution efficiency by using techniques like autotuning and dynamic scheduling, modern software environments adapt to improve /programmer/ efficiency [1]. Classic examples include dynamic linking, dynamic memory allocation, garbage collection, interpreters, just-in-time compilers, and debugger-support. Examples that are more recent are selective embedded just in time specialization (SEJITS) [2] for highly productive languages like Python and Ruby. Thus, the future of programming is likely to involve program generators at many levels of the hierarchy tailoring the application to the machine. These productivity advances via adaptivity should be reflected in modern benchmarks: virtually no one writes the statically linked, highest-level-optimized C programs that are the foundation of most benchmark suites.
The dream is to improve productivity without sacrificing too much performance. Indeed, how often have you heard the claim that a new productive environment is now "almost as fast as C" or "almost as fast as Java?" The implication of the necessary tie between productivity and performance in the manycore era is that these modern environments must be able to utilize manycore well, or the gap between highly efficient code and highly productive code will grow with the number of cores.
For industry's bet on manycore to win, therefore, both very high level and very low level programming environments will need to be able to understand and measure their underlying hardware and adapt their execution so as to be portable, relatively fast, and energy-efficient.
Hence, we argue that a standard of accurate hardware operation trackers (SHOT) would have a huge positive impact on making parallel software portable with good performance and energy efficiency, similar to the impact of the IEEE-754 standard had on portability of numerical software. In particular, we believe SHOT will lead to much larger improvements in portability, performance, energy efficiency of parallel codes than recent architectural fads like opportunistic "turbo modes," transactional memory, or reconfigurable computing.
[1]
Patricia J. Teller,et al.
Just how accurate are performance counters?
,
2001,
Conference Proceedings of the 2001 IEEE International Performance, Computing, and Communications Conference (Cat. No.01CH37210).
[2]
James Demmel,et al.
the Parallel Computing Landscape
,
2022
.
[3]
David A. Patterson,et al.
A case for FAME: FPGA architecture model execution
,
2010,
ISCA.
[4]
Kunle Olukotun,et al.
Ubiquitous Parallel Computing from Berkeley, Illinois, and Stanford
,
2010,
IEEE Micro.
[5]
John Shalf,et al.
SEJITS: Getting Productivity and Performance With Selective Embedded JIT Specialization
,
2010
.