Improving server software support for simultaneous multithreaded processors

Simultaneous multithreading (SMT) represents a fundamental shift in processor capability. SMT's ability to execute multiple threads simultaneously within a single CPU offers tremendous potential performance benefits. However, the structure and behavior of software affects the extent to which this potential can be achieved. Consequently, just like the earlier arrival of multiprocessors, the advent of SMT processors prompts a needed re-evaluation of software that will run on them. This evaluation is complicated, since SMT adopts architectural features and operating costs of both its predecessors (uniprocessors and multiprocessors). The crucial task for researchers is to determine which software structures and policies - multi-processor, uniprocessor, or neither - are most appropriate for SMT.This paper evaluates how SMT's changes to the underlying hardware affects server software, and in particular, SMT's effects on memory allocation and synchronization. Using detailed simulation of an SMT server implemented in three different thread models, we find that the default policies often provided with multiprocessor operating systems produce unacceptably low performance. For each area that we examine, we identify better policies that combine techniques from both uniprocessors and multi-processors. We also uncover a vital aspect of multi-threaded synchronization (interaction with operating system thread scheduling) that previous research on SMT synchronization had overlooked. Overall, our results demonstrate how a few simple changes to applications' run-time support libraries can dramatically boost the performance of multi-threaded servers on SMT, without requiring modifications to the applications themselves.

[1]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[2]  Paul R. Wilson,et al.  The memory fragmentation problem: solved? , 1998, ISMM '98.

[3]  Jenn-Yuan Tsai,et al.  Performance study of a concurrent multithreaded processor , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[4]  Wen-Jing Hsu,et al.  A Scalable and Efficient Storage Allocator on Shared-Memory Multiprocessors , 2001, Parallel Process. Lett..

[5]  Randy H. Katz,et al.  The effect of sharing on the cache and bus performance of parallel programs , 1989, ASPLOS III.

[6]  Susan J. Eggers,et al.  An analysis of database workload performance on simultaneous multithreaded processors , 1998, ISCA.

[7]  Dean M. Tullsen,et al.  Simultaneous multithreading: a platform for next-generation processors , 1997, IEEE Micro.

[8]  R. R. Oldehoeft,et al.  Parallel Dynamic Storage Allocation , 1985, ICPP.

[9]  Arun Iyengar,et al.  Parallel dynamic storage allocation algorithms , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.

[10]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[11]  Ali R. Hurson,et al.  Effects of Multithreading on Cache Performance , 1999, IEEE Trans. Computers.

[12]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[13]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[14]  Per-Åke Larson,et al.  Memory allocation for long-running server applications , 1998, ISMM '98.

[15]  S. Parekh,et al.  An analysis of database workload performance on simultaneous multithreaded processors , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[16]  James R. Larus,et al.  Using Cohort-Scheduling to Enhance Server Performance , 2002, USENIX Annual Technical Conference, General Track.

[17]  Susan J. Eggers,et al.  An analysis of software interface issues for SMT processors , 2002 .

[18]  Dean M. Tullsen,et al.  Supporting fine-grained synchronization on a simultaneous multithreading processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[19]  Brian N. Bershad,et al.  Scheduler activations: effective kernel support for the user-level management of parallelism , 1991, TOCS.

[20]  James R. Goodman,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, MICRO.

[21]  Willy Zwaenepoel,et al.  Flash: An efficient and portable Web server , 1999, USENIX Annual Technical Conference, General Track.

[22]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[23]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[24]  Dean M. Tullsen,et al.  Tuning Compiler Optimizations for Simultaneous Multithreading , 2004, International Journal of Parallel Programming.

[25]  P. Gronowski,et al.  Design of an 8-wide superscalar RISC microprocessor with simultaneous multithreading , 2002, 2002 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.02CH37315).

[26]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[27]  Richard E. Kessler,et al.  Performance analysis of the Alpha 21264-based Compaq ES40 system , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[28]  S. McFarling Combining Branch Predictors , 1993 .

[29]  Lars Lundberg,et al.  Attacking the dynamic memory problem for SMPs , 2000 .

[30]  Kathryn S. McKinley,et al.  Reconsidering custom memory allocation , 2002, OOPSLA '02.

[31]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[32]  Dean M. Tullsen,et al.  Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading , 1997, TOCS.

[33]  Anna R. Karlin,et al.  Empirical studies of competitve spinning for a shared-memory multiprocessor , 1991, SOSP '91.

[34]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[35]  Kathryn S. McKinley,et al.  Composing high-performance memory allocators , 2001, PLDI '01.

[36]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[37]  Susan J. Eggers,et al.  An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture , 2000, ASPLOS.

[38]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[39]  Susan J. Eggers,et al.  Reducing false sharing on shared memory multiprocessors through compile time data transformations , 1995, PPOPP '95.

[40]  Susan J. Eggers,et al.  An analysis of operating system behavior on a simultaneous multithreaded architecture , 2000, ASPLOS IX.

[41]  Alek Vainshtein,et al.  Optimal Strategies for Spinning and Blocking , 1994, J. Parallel Distributed Comput..