Performance Characterization of the 64-bit x86 Architecture from Compiler Optimizations' Perspective

Intel Extended Memory 64 Technology (EM64T) and AMD 64-bit architecture (AMD64) are emerging 64-bit x86 architectures that are fully x86 compatible. Compared with the 32-bit x86 architecture, the 64-bit x86 architectures cater some new features to applications. For instance, applications can address 64 bits of virtual memory space, perform operations on 64-bit-wide operands, get access to 16 general-purpose registers (GPRs) and 16 extended multi-media (XMM) registers, and use a register-based argument passing convention. In this paper, we investigate the performance impacts of these new features from compiler optimizations' standpoint. Our research compiler is based on the Intel Fortran/C++ production compiler, and our experiments are conducted on the SPEC2000 benchmark suite. Results show that for 64-bit-wide pointer and long data types, several SPEC2000 C benchmarks are slowed down by more than 20%, which is mainly due to the enlarged memory footprint. To evaluate the performance potential of 64-bit x86 architectures, we designed and implemented the LP32 code model such that the sizes of pointer and long are 32 bits. Our experiments demonstrate that on average the LP32 code model speeds up the SPEC2000 C benchmarks by 13.4%. For the register-based argument passing convention, our experiments show that the performance gain is less than 1% because of the aggressive function inlining optimization. Finally, we observe that using 16 GPRs and 16 XMM registers significantly outperforms the scenario when only 8 GPRs and 8 XMM registers are used. However, our results also show that using 12 GPRs and 12 XMM registers can achieve as competitive performance as employing 16 GPRs and 16 XMM registers.

[1]  Gregory J. Chaitin,et al.  Register allocation & spilling via graph coloring , 1982, SIGPLAN '82.

[2]  David Ryan Koes,et al.  A progressive register allocator for irregular architectures , 2005, International Symposium on Code Generation and Optimization.

[3]  Intel Corportation,et al.  IA-32 Intel Architecture Software Developers Manual , 2004 .

[4]  Brinkley Sprunt,et al.  Pentium 4 Performance-Monitoring Features , 2002, IEEE Micro.

[5]  Vivek Sarkar,et al.  Linear scan register allocation , 1999, TOPL.

[6]  Brian T. Lewis,et al.  Improving 64-bit Java IPF performance by compressing heap references , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[7]  Andrew W. Appel,et al.  Optimal spilling for CISC machines with few registers , 2001, PLDI '01.

[8]  Ken Kennedy,et al.  RETROSPECTIVE: Coloring Heuristics for Register Allocation , 2022 .

[9]  Vikram S. Adve,et al.  Transparent pointer compression for linked data structures , 2005, MSP '05.

[10]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[11]  Jan Hubǐcka Porting GCC to the AMD 64 architecture , 2003 .

[12]  Timothy Kong,et al.  Precise register allocation for irregular architectures , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[13]  David A. Patterson,et al.  Performance characterization of a Quad Pentium Pro SMP using OLTP workloads , 1998, ISCA.

[14]  Mikael Pettersson,et al.  Efficiently compiling a functional language on AMD64: the HiPE experience , 2005, PPDP '05.

[15]  David J. DeWitt,et al.  DBMSs on a Modern Processor: Where Does Time Go? , 1999, VLDB.

[16]  Guang R. Gao,et al.  Minimum Register Instruction Sequencing to Reduce Register Spills in Out-of-Order Issue Superscalar Architectures , 2003, IEEE Trans. Computers.