Analytical performance analysis and modeling of superscalar and multi-threaded processors

viii SAMENVATTING demonstreren we met een studie van de prestatie-impact van verschil-lende compileroptimalisaties. We hebben ook de prestatietellerarchitectuur uitgebreid naar meer-dradige processors (SMT processors). Het belangrijkste probleem in meerdradige processors is de onderlinge prestatiebe¨ınvloeding tussen de draden, wat het moeilijk maakt om de prestatie van de individuele draden te isoleren. Daarom ontwikkelden we een mechanisme dat in staat is de cycle component stacks van elk van de individuele draden te reconstrueren alsof ze ge¨ısoleerd op eenééndradige processor zou-den uitvoeren. Daardoor kan het de voortgang van elk van de draden nauwkeurig schatten, wat nuttig is voor systeemsoftware of-hardware om in een bepaalde dienstverleningskwaliteit (bv. prestatiegarantie) te voorzien op meerdradige processors. Summary Due to its complexity, analyzing the performance of a modern super-scalar processor is a challenging task. The processor can execute multiple instructions per cycle, and the instructions execute possibly out-of-order. In addition, various miss events can happen at different stages in the processor pipeline: the fetching of instructions can stall due to instruction cache misses, branch mispredictions cause the fetching of wrong-path instructions, which will eventually be flushed, and data cache misses can drastically delay the execution of memory instructions. Furthermore, miss event handling can be overlapped with instruction execution and/or the handling of other miss events. Multi-threaded processors (e.g., simultaneous multithreading (SMT) processors) are even more difficult to analyze, since the concurrently executing threads are closely interlaced and have an impact on each other's performance. As such, it is difficult to get an intuitive understanding of the performance of a program executing on a contemporary processor, and get insight into how big the performance impact is of the various miss events that can occur within the processor. Therefore, performance evaluation in an experimental context has shifted towards simulation. Simulation has the advantage that it is very flexible and that its overall performance estimations are very accurate. It is however very time-consuming, simulating a few seconds of real execution time can take hours or days, even with the fastest simulators running on today's fastest computers. It also provides little insight into the factors that determine overall performance. The lack of insight also makes accurate on-line performance monitoring more difficult. Hardware performance counters can measure various events that have an impact on performance, but they do not quantify the contribution of these events to overall performance. There-x SUMMARY fore, the insight they provide into the performance of a program executing on …

[1]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[2]  R. Govindarajan,et al.  Evaluating register allocation and instruction scheduling techniques in out-of-order issue processors , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[3]  Brinkley Sprunt,et al.  Pentium 4 Performance-Monitoring Features , 2002, IEEE Micro.

[4]  Brad Calder,et al.  Picking statistically valid and early simulation points , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[5]  David M. Brooks,et al.  Accurate and efficient regression modeling for microarchitectural performance and power prediction , 2006, ASPLOS XII.

[6]  James E. Smith,et al.  The microarchitecture of superscalar processors , 1995, Proc. IEEE.

[7]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[8]  T. Puzak,et al.  The optimum pipeline depth for a microprocessor , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[9]  Andrew F. Glew MLP yes! ILP no , 1998, ASPLOS 1998.

[10]  Steven K. Reinhardt,et al.  The impact of resource partitioning on SMT processors , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[11]  Kapil Vaswani,et al.  A Predictive Performance Model for Superscalar Processors , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[12]  Thomas R. Puzak,et al.  Optimum power/performance pipeline depth , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[13]  Roland E. Wunderlich,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[14]  Kapil Vaswani,et al.  Construction and use of linear regression models for processor performance analysis , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[15]  Greg Hamerly,et al.  SimPoint 3.0: Faster and More Flexible Program Analysis , 2005 .

[16]  Tejas Karkhanis,et al.  A Day in the Life of a Data Cache Miss , 2002 .

[17]  James E. Smith,et al.  Optimal Pipelining in Supercomputers , 1986, ISCA.

[18]  Eric Sprangle,et al.  Increasing processor performance by implementing deeper pipelines , 2002, ISCA.

[19]  Sarita V. Adve,et al.  Performance of database workloads on shared-memory systems with out-of-order processors , 1998, ASPLOS VIII.

[20]  Yan Solihin,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[21]  Gregory F. Grohoski,et al.  Machine Organization of the IBM RISC System/6000 Processor , 1990, IBM J. Res. Dev..

[22]  William Stallings,et al.  Operating Systems: Internals and Design Principles , 1991 .

[23]  Dean M. Tullsen,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[24]  Stéphan Jourdan,et al.  Exploring instruction-fetch bandwidth requirement in wide-issue superscalar processors , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[25]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[26]  John Paul Shen,et al.  A framework for statistical modeling of superscalar processor performance , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[27]  Dean M. Tullsen,et al.  Symbiotic jobscheduling with priorities for a simultaneous multithreading processor , 2002, SIGMETRICS '02.

[28]  Avi Mendelson,et al.  Fairness enforcement in switch on event multithreading , 2007, TACO.

[29]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[30]  Lieven Eeckhout,et al.  Memory Data Flow Modeling in Statistical Simulation for the Efficient Exploration of Microprocessor Design Spaces , 2008, IEEE Transactions on Computers.

[31]  Dean M. Tullsen,et al.  Fellowship - Simulation And Modeling Of A Simultaneous Multithreading Processor , 1996, Int. CMG Conference.

[32]  James E. Smith,et al.  Modeling superscalar processors via statistical simulation , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[33]  Stéphan Jourdan,et al.  An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors , 2004, International Journal of Parallel Programming.

[34]  Rohit Jain,et al.  Soft real-time scheduling on simultaneous multithreaded processors , 2002, 23rd IEEE Real-Time Systems Symposium, 2002. RTSS 2002..

[35]  James E. Smith,et al.  Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[36]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[37]  Edward M. Riseman,et al.  The Inhibition of Potential Parallelism by Conditional Jumps , 1972, IEEE Transactions on Computers.

[38]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[39]  James E. Smith,et al.  Virtual private caches , 2007, ISCA '07.

[40]  Yan Solihin,et al.  QoS policies and architecture for cache/memory in CMP platforms , 2007, SIGMETRICS '07.

[41]  David B. Papworth Tuning the Pentium Pro microarchitecture , 1996, IEEE Micro.

[42]  Sally A. McKee,et al.  Efficiently exploring architectural design spaces via predictive modeling , 2006, ASPLOS XII.

[43]  Douglas M. Hawkins,et al.  Characterizing and comparing prevailing simulation techniques , 2005, 11th International Symposium on High-Performance Computer Architecture.

[44]  S. Turner,et al.  Performance Analysis Using the MIPS R10000 Performance Counters , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[45]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[46]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[47]  Alex Mericas Performance Monitoring on the POWER5™ Microprocessor , 2005 .

[48]  Santosh G. Abraham,et al.  Efficient simulation of caches under optimal replacement with applications to miss characterization , 1993, SIGMETRICS '93.

[49]  Ravi R. Iyer,et al.  CQoS: a framework for enabling QoS in shared caches of CMP platforms , 2004, ICS '04.

[50]  Frederic T. Chong,et al.  HLS: combining statistical and symbolic simulation to guide microprocessor designs , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[51]  Tarek M. Taha,et al.  An Instruction Throughput Model of Superscalar Processors , 2008, IEEE Trans. Computers.

[52]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[53]  Manoj Franklin,et al.  Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[54]  Pradip Bose,et al.  Optimizing pipelines for power and performance , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[55]  Eric M. Schwarz,et al.  IBM POWER6 microarchitecture , 2007, IBM J. Res. Dev..

[56]  D. Patterson,et al.  Performance characterization of a quad Pentium Pro SMP using OLTP workloads , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[57]  Lizy Kurian John,et al.  Benchmarking Internet Servers on Superscalar Machines , 2001 .

[58]  James E. Smith,et al.  Automated design of application specific superscalar processors: an analytical approach , 2007, ISCA '07.

[59]  Dean M. Tullsen,et al.  Handling long-latency loads in a simultaneous multithreading processor , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[60]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[61]  Simcha Gochman,et al.  Introduction to Intel Core Duo Processor Architecture , 2006 .