The architectural costs of streaming I/O: A comparison of workstations, clusters, and SMPs

We investigate resource usage while performing streaming I/O by contrasting three architectures, a single workstation, a cluster, and an SMP, under various I/O benchmarks. We derive analytical and empirically-based models of resource usage during data transfer, examining the I/O bus, memory bus, network, and processor of each system. By investigating each resource in detail, we assess what comprises a well-balanced system for these workloads. We find that the architectures we study are not well balanced for streaming I/O applications. Across the platforms, the main limitation to attaining peak performance is the CPU, due to lack of data locality. Increasing processor performance (especially with improved block operation performance) will be of great aid for these workloads in the future. For a cluster workstation, the I/O bus is a major system bottleneck, because of the increased load placed on it from network communication. A well-balanced cluster workstation should have copious I/O bus bandwidth, perhaps via multiple I/O busses. The SMP suffers from poor memory-system performance; even when there is true parallelism in the benchmark, contention in the shared-memory system leads to reduced performance. As a result, the clustered workstations provide higher absolute performance for streaming I/O workloads.

[1]  Michael Stonebraker,et al.  A measure of transaction processing power , 1985 .

[2]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, HPTS.

[3]  D. Culler,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[4]  Josep Torrellas,et al.  Characterizing the caching and synchronization performance of a multiprocessor operating system , 1992, ASPLOS V.

[5]  S. Kleiman,et al.  Symmetric multiprocessing in Solaris 2.0 , 1992, Digest of Papers COMPCON Spring 1992.

[6]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[7]  Brian N. Bershad,et al.  The impact of operating system structure on memory system performance , 1994, SOSP '93.

[8]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[9]  C. Kuszmaul Out of Core FFTs in a Parallel Application Environment , 1993 .

[10]  David A. Patterson,et al.  A case for networks of workstations (now) , 1994, Symposium Record Hot Interconnects II.

[11]  David B. Lomet,et al.  AlphaSort: a RISC machine sort , 1994, SIGMOD '94.

[12]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[13]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[14]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[15]  Sharon E. Perl,et al.  Studies of Windows NT performance using dynamic execution traces , 1996, OSDI '96.

[16]  Ramesh C. Agarwal,et al.  A super scalar sort algorithm for RISC processors , 1996, SIGMOD '96.

[17]  Richard P. Martin,et al.  Assessing Fast Network Interfaces , 1996, IEEE Micro.

[18]  Mark D. Hill,et al.  A case for making network interfaces less peripheral , 1997 .

[19]  Andrea C. Arpaci-Dusseau,et al.  High-performance sorting on networks of workstations , 1997, SIGMOD '97.

[20]  Mark D. Hill,et al.  Making Network Interfaces Less Peripheral , 1998, Computer.