Searching for the sorting record: experiences in tuning NOW-Sort

We present our experiences in developing and tuning the performance of NOW-Sort, a parallel, disk-to-disk sorting algorithm. NOW-Sort currently holds two world records in databaseindustry standard benchmarks. Critical to the tuning process was the setting of expectations, which tell the programmer both where to tune and when to stop. We found three categories of useful tools: tools that help set expectations and configure the application to different hardware parameters, visualization tools that animate performance counters, and search tools that track down performance anomalies. All such tools must interact well with all layers of the underlying software (e.g., the operating system), as well as with applications that leverage modern OS features, such as threads and memory-mapped I/O.

[1]  Michael Stonebraker,et al.  Operating system support for database management , 1981, CACM.

[2]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[3]  Sharon E. Perl,et al.  Studies of Windows NT performance using dynamic execution traces , 1996, OSDI '96.

[4]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[5]  Rafael Hector Saavedra-Barrera,et al.  CPU performance evaluation and execution time prediction using narrow spectrum benchmarking , 1992 .

[6]  Margaret Martonosi,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.

[7]  Thomas E. Anderson,et al.  Quartz: a tool for tuning parallel program performance , 1990, SIGMETRICS '90.

[8]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[9]  R. V. Meter Observing the effects of multi-zone disks , 1997 .

[10]  Richard P. Martin,et al.  LogP Performance Assessment of Fast Network Interfaces , 1995 .

[11]  Wei Hu,et al.  Scalability in the XFS File System , 1996, USENIX Annual Technical Conference.

[12]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[13]  Sharon E. Perl Performance assertion checking , 1993, SOSP '93.

[14]  Margaret Martonosi,et al.  The SHRIMP performance monitor: design and applications , 1996, SPDT '96.

[15]  Evgenia Smirni,et al.  I/O, performance analysis, and performance data immersion , 1996, Proceedings of MASCOTS '96 - 4th International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[16]  Scott Devine,et al.  Using the SimOS machine simulator to study complex computer systems , 1997, TOMC.

[17]  Marc Tremblay,et al.  The design of the microarchitecture of UltraSPARC-I , 1995 .

[18]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.

[19]  Richard P. Martin,et al.  Assessing Fast Network Interfaces , 1996, IEEE Micro.

[20]  Ethan L. Miller,et al.  Using content-derived names for configuration management , 1997, SSR '97.

[21]  B. Miller,et al.  The Paradyn Parallel Performance Measurement Tools , 1995 .

[22]  Andrea C. Arpaci-Dusseau,et al.  High-performance sorting on networks of workstations , 1997, SIGMOD '97.

[23]  Ramesh C. Agarwal,et al.  A super scalar sort algorithm for RISC processors , 1996, SIGMOD '96.

[24]  David B. Lomet,et al.  AlphaSort: a RISC machine sort , 1994, SIGMOD '94.

[25]  Jeffrey K. Hollingsworth,et al.  Finding bottlenecks in large scale parallel programs , 1995, Technical Report / University of Wisconsin, Madison / Computer Sciences Department.

[26]  Amin Vahdat,et al.  GLUix: a global layer unix for a network of workstations , 1998 .

[27]  Joel H. Saltz,et al.  Tuning the performance of I/O-intensive parallel applications , 1996, IOPADS '96.

[28]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[29]  Michael Stonebraker,et al.  A measure of transaction processing power , 1985 .

[30]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[31]  Andrea C. Arpaci-Dusseau,et al.  The architectural costs of streaming I/O: A comparison of workstations, clusters, and SMPs , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[32]  David Keppel,et al.  Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[33]  David E. Culler,et al.  Active message applications programming interface and communication subsystem organization , 1995 .

[34]  S. Kleiman,et al.  Symmetric multiprocessing in Solaris 2.0 , 1992, Digest of Papers COMPCON Spring 1992.

[35]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.