The compare and branch sequences required in a traditional sort algorithm can not efficiently exploit multiple execution units present in currently available high performance RISC processors. This is because of the long latency of the compare instructions and the sequential algorithm used in sorting. With the increased level of integration on a chip, this trend is expected to continue. We have developed new sort algorithms which eliminate almost all the compares, provide functional parallelism which can be exploited by multiple execution units, significantly reduce the number of passes through keys, and improve data locality. These new algorithms outperform traditional sort algorithms by a large factor.For the Datamation disk to disk sort benchmark (one million 100-byte records), at SIGMOD'94, Chris Nyberg et al presented several new performance records using DEC alpha processor based systems.We have implemented the Datamation sort benchmark using our new sort algorithm on a desktop IBM RS/6000 model 39H (66.6 MHz) with 8 IBM SSA 7133 disk drives (total cost $73K). The total elapsed time for the 100 MB sort was 5.1 seconds (vs the old uni-processor record of 9.1 seconds). We have also established a new price performance record (0.2¢ vs the old record of 0.9¢, as the cost of the sort). The entire sort processing was overlapped with I/O. During the read phase, we achieved a sustained BW of 47 MB/sec and during the write phase, we achieved a sustained BW of 39 MB/sec. Key extraction and sorting of one million 10-byte keys took only 0.6 second of CPU time. The rest of the CPU time was used in moving records, servicing I/O, and other overheads.Algorithmic details leading to this level of performance are described in this paper. A detailed analysis of the CPU time spent during various phases of the sort algorithm and I/O is also provided.
[1]
J. Wrench.
Table errata: The art of computer programming, Vol. 2: Seminumerical algorithms (Addison-Wesley, Reading, Mass., 1969) by Donald E. Knuth
,
1970
.
[2]
Donald E. Knuth,et al.
Sorting and Searching
,
1973
.
[3]
Sorting and searching" the art of computer programming
,
1973
.
[4]
Michael Stonebraker,et al.
A measure of transaction processing power
,
1985
.
[5]
Bjørn Arild W. Baugstø,et al.
Parallel Sorting Methods for Large Data Volumes on a Hypercube Database Computer
,
1989,
IWDM.
[6]
Bjørn Arild W. Baugstø,et al.
Sorting Large Data Files on POOMA
,
1990,
CONPAR.
[7]
Roderic G. G. Cattell.
The benchmark handbook for database and transaction processing systems
,
1991
.
[8]
David J. DeWitt,et al.
Parallel sorting on a shared-nothing architecture using probabilistic splitting
,
1991,
[1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.
[9]
David H. Bailey,et al.
The Nas Parallel Benchmarks
,
1991,
Int. J. High Perform. Comput. Appl..
[10]
Jim Gray,et al.
Benchmark Handbook: For Database and Transaction Processing Systems
,
1992
.
[11]
Zarka Cvetanovic,et al.
Characterization of Alpha AXP performance using TP and SPEC workloads
,
1994,
Proceedings of 21 International Symposium on Computer Architecture.
[12]
Ramesh C. Agarwal,et al.
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
,
1994,
IBM J. Res. Dev..
[13]
David B. Lomet,et al.
AlphaSort: a RISC machine sort
,
1994,
SIGMOD '94.
[14]
Bowen Alpern,et al.
High-Performance Parallel Implementations of the NAS Kernel Benchmarks on the IBM SP2
,
1995,
IBM Syst. J..