BHive: A Benchmark Suite and Measurement Framework for Validating x86-64 Basic Block Performance Models

Compilers and performance engineers use hardware performance models to simplify program optimizations. Performance models provide a necessary abstraction over complex modern processors. However, constructing and maintaining a performance model can be onerous, given the numerous microarchi-tectural optimizations employed by modern processors. Despite their complexity and reported inaccuracy (e.g., deviating from native measurement by more than 30%), existing performance models-such as IACA and llvm-mca-have not been systematically validated, because there is no scalable machine code profiler that can automatically obtain throughput of arbitrary basic blocks while conforming to common modeling assumptions. In this paper, we present a novel profiler that can profile arbitrary memory-accessing basic blocks without any user intervention. We used this profiler to build BHive, a benchmark for systematic validation of performance models of x86-64 basic blocks. We used BHive to evaluate four existing performance models: IACA, llvm-mca, Ithemal, and OSACA. We automatically cluster basic blocks in the benchmark suite based on their utilization of CPU resources. Using this clustering, our benchmark can give a detailed analysis of a performance model's strengths and weaknesses on different workloads (e.g., vectorized vs. scalar basic blocks). We additionally demonstrate that our dataset well captures basic properties of two Google applications: Spanner and Dremel.

[1]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[2]  John Whaley,et al.  A portable sampling-based profiler for Java virtual machines , 2000, JAVA '00.

[3]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[6]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[7]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[8]  M. Pharr,et al.  ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[9]  Ingo Wald,et al.  Embree: a kernel framework for efficient CPU ray tracing , 2014, ACM Trans. Graph..

[10]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[11]  Gerhard Wellein,et al.  Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures , 2018, 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[12]  Saman P. Amarasinghe,et al.  goSLP: globally optimized superword level parallelism framework , 2018, Proc. ACM Program. Lang..

[13]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[14]  Jan Reineke,et al.  uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures , 2018, ASPLOS.

[15]  Ben Juurlink,et al.  Portable Cost Modeling for Auto-Vectorizers , 2019, 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[16]  Michael Carbin,et al.  Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks , 2018, ICML.