Asymmetries in Multi-Core Systems – Or Why We Need Better Performance Measurement Units

Future exascale systems will be based on multi-core processors, but even today’s multi-core processors can be asymmetric and exhibit limitations and bottlenecks that are different from those found on a symmetric multiprocessor. In this paper we investigate the performance of a cluster node based on the Intel Xeon E5345 quad-core processor and note that despite the symmetry implied by the programming model, the available memory bandwidth is not shared equally among the cores. Consequently, applications experience substantial performance variance and slow-downs when the tasks (threads) are mapped to cores in a naive manner. An operating system scheduler could mitigate these effects by taking into account the memory bus structure but needs accurate information from the performance monitoring unit as the asymmetry is not directly exposed in the processor’s instruction set manual. Current performance monitoring units are quite inflexible and change from one processor to the next, so higher levels of the software tool chain are discouraged to use them. The next generation of Nehalem-based multicore systems poses similar challenges, and the development of portable performance monitoring units will be crucial if applications want to use the performance potential of exascale systems. We expect this situation to remain unchanged as long as memory is slow relative to the processor.

[1]  Frank Bellosa,et al.  Process Cruise Control: Throttling Memory Access in a Soft Real-Time Environment , 1997, SOSP 1997.

[2]  Dimitrios S. Nikolopoulos,et al.  Scheduling algorithms with bus bandwidth considerations for SMPs , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[3]  Margo I. Seltzer,et al.  Chip multithreading systems need a new operating system scheduler , 2004, EW 11.

[4]  Nectarios Koziris,et al.  Memory bandwidth aware scheduling for SMP cluster nodes , 2005, 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing.

[5]  Margo I. Seltzer,et al.  Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design , 2005, USENIX Annual Technical Conference, General Track.

[6]  Nectarios Koziris,et al.  Memory and network bandwidth aware scheduling of multiprogrammed workloads on clusters of SMPs , 2006, 12th International Conference on Parallel and Distributed Systems - (ICPADS'06).

[7]  Xiao Zhang,et al.  Processor Hardware Counter Statistics as a First-Class System Resource , 2007, HotOS.

[8]  Stéphane Eranian What can performance counters do for memory subsystem analysis? , 2008, MSPC '08.

[9]  Sally A. McKee,et al.  Can hardware performance counters be trusted? , 2008, 2008 IEEE International Symposium on Workload Characterization.

[10]  Tong Li,et al.  Using OS Observations to Improve Performance in Multicore Systems , 2008, IEEE Micro.

[11]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enabling High-Performance and Fair Shared Memory Controllers , 2009, IEEE Micro.

[12]  Matthias Hauswirth,et al.  Accuracy of performance counter measurements , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[13]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[14]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[15]  Mor Harchol-Balter,et al.  ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[16]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[17]  Nagesh B. Lakshminarayana,et al.  Asymmetry Aware Scheduling Algorithms for Asymmetric Multiprocessors , .