Knows Best : A Case for Hardware Transparency and Measurability by

Future gains in computer system performance will only come from increased parallelism and efficient execution on specialized hardware. However parallel programming is more difficult than sequential programming, which means parallel programming will only be used if it leads to improved performance or energy efficiency. Because portability and reusability are required to reduce software costs, parallel software must become performance-portable to become mainstream. In this paper, we evaluate the state of performance-portability across several current platforms and explore approaches to realizing performanceportability for future applications. We find that appropriate hardware measurements are crucial to all our techniques, but existing detailed microarchitectural performance counters were not designed for use by application software. We provide examples that show how they fail to support the needs of an adaptive parallel software stack making the creation of performance-portable software nigh unto impossible. We propose SHOT (Standardized Hardware Operation Tracker), which provides a standardized architecture to access to a few high-level system measurements. We argue a standardized hardware measurement system will contribute more to the success of the parallel revolution than many other proposed hardware mechanisms by enabling software to adapt to underlying hardware resources.

[1]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[2]  Xiao Zhang,et al.  Processor Hardware Counter Statistics as a First-Class System Resource , 2007, HotOS.

[3]  Kevin Klues,et al.  Processes and Resource Management in a Scalable Many-core OS ∗ , 2010 .

[4]  Ruby B. Lee,et al.  New cache designs for thwarting software cache-based side channel attacks , 2007, ISCA '07.

[5]  J. Kubiatowicz,et al.  Resource Management in the Tessellation Manycore OS ∗ , 2010 .

[6]  Rajarshi Das,et al.  Utility-Function-Driven Resource Allocation in Autonomic Systems , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[7]  James H. Anderson,et al.  On the Design and Implementation of a Cache-Aware Multicore Real-Time Scheduler , 2009, 2009 21st Euromicro Conference on Real-Time Systems.

[8]  Michael Stumm,et al.  Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors , 2007, EuroSys '07.

[9]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Victor Eijkhout,et al.  Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[11]  Francisco J. Cazorla,et al.  Multicore Resource Management , 2008, IEEE Micro.

[12]  Jordi Torres,et al.  Autonomic QoS-Aware resource management in grid computing using online performance models , 2007, ValueTools '07.

[13]  David A. Patterson,et al.  RAMP gold: An FPGA-based architecture simulator for multiprocessors , 2010, Design Automation Conference.

[14]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[15]  Yan Solihin,et al.  A Framework for Providing Quality of Service in Chip Multi-Processors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[16]  Kevin Skadron,et al.  Predictive design space exploration using genetically programmed response surfaces , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[17]  Xiao Zhang,et al.  Hardware counter driven on-the-fly request signatures , 2008, ASPLOS.

[18]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[19]  Edward A. Lee Computing needs time , 2009, CACM.

[20]  Sandra Fillebrown,et al.  The MathWorks' MATLAB , 1996 .

[21]  David A. Patterson,et al.  A case for FAME: FPGA architecture model execution , 2010, ISCA.

[22]  Samuel Williams,et al.  Auto-tuning performance on multicore computers , 2008 .

[23]  Christian Bienia,et al.  PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors , 2009 .

[24]  Sally A. McKee,et al.  Can hardware performance counters be trusted? , 2008, 2008 IEEE International Symposium on Workload Characterization.

[25]  B. Minasny The Elements of Statistical Learning, Second Edition, Trevor Hastie, Robert Tishirani, Jerome Friedman. (2009), Springer Series in Statistics, ISBN 0172-7397, 745 pp , 2009 .

[26]  Patricia J. Teller,et al.  Just how accurate are performance counters? , 2001, Conference Proceedings of the 2001 IEEE International Performance, Computing, and Communications Conference (Cat. No.01CH37210).

[27]  Patricia J. Teller,et al.  Accuracy of Performance Monitoring Hardware , 2002 .

[28]  Kevin Klues,et al.  Tessellation: space-time partitioning in a manycore client OS , 2009 .

[29]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[30]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[31]  Archana Ganapathi,et al.  A case for machine learning to optimize multicore performance , 2009 .

[32]  S. Bates,et al.  Formulation of the Audze--Eglais uniform Latin hypercube design of experiments , 2003 .

[33]  Michael I. Jordan,et al.  Automatic exploration of datacenter performance regimes , 2009, ACDC '09.

[34]  Min Zhou,et al.  Experiences and lessons learned with a portable interface to hardware performance counters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[35]  Larry Wasserman,et al.  All of Nonparametric Statistics (Springer Texts in Statistics) , 2006 .

[36]  Daniel A. Menascé,et al.  Resource Allocation for Autonomic Data Centers using Analytic Performance Models , 2005, Second International Conference on Autonomic Computing (ICAC'05).