On sorting strings in external memory (extended abstract)

In this paper we address for the first time the I/O complexity of the problem of sorting strings in external memory, which is a fundamental component of many large-scale text applications. In the standard unit-cost RAM comparison model, the complexity of sorting K strings of total length N is (K log2K+N). By analogy, in the external memory (or I/O) model, where the internal memory has size M and the block transfer size is B, it would be natural to guess that the I/O complexity of sorting strings is (KB logM=B KB + NB ), but the known algorithms do not come even close to achieving this bound. Our results show, somewhat counterintuitively, that the I/O complexity of string sorting depends upon the length of the strings relative to the block size. We first consider a simple comparison I/O model, where one is not allowed to break the strings into their characters, and we show that the I/O complexity of string sorting in this model is (N1 B logM=B N1 B +K2 logM=BK2+NB ), whereN1 is the total length of all strings shorter than B andK2 is the number of strings longer than B. We then consider two more general I/O comparison models in which string breaking is allowed. We obtain improved algorithms and in several cases lower bounds that match their I/O bounds. Finally, we develop more practical algorithms without assuming the comparison model. Department of Computer Science, Duke University, Durham, NC 27708–0129, USA. Email: large@cs.duke.edu. Supported in part by the U.S. Army Research Office under grant DAAH04–96–1–0013 and by the ESPRIT Long Term Research Programme under project 20244 (ALCOM–IT). Part of this work was done while at BRICS, Dept. of Computer Science, University of Aarhus, Denmark, and while visiting Universita di Firenze. y Dipartimento di Informatica, Universita di Pisa, Pisa, Italy. Email: ferragin@di.unipi.it. Supported in part by MURST of Italy. z Dipartimento di Sistemi e Informatica, Universita di Firenze, Firenze, Italy. Email: grossi@dsi2.dsi.unifi.it. Part of this work was done while visiting BRICS, University of Aarhus, Denmark. x Department of Computer Science, Duke University, Durham, NC 27708–0129, USA. Email: jsv@cs.duke.edu. Supported in part by the U.S. Army Research Office under grants DAAH04–93–G–0076 and DAAH04–96–1–0013 and by the National Science Foundation under grant CCR–9522047.

[1]  Garth A. Gibson Report of the Working Group on Storage I/O Issues in Large-Scale Computing , 1996 .

[2]  Rudolf Bayer,et al.  Prefix B-trees , 1977, TODS.

[3]  Jeffrey Scott Vitter,et al.  External-Memory Algorithms for Processing Line Segments in Geographic Information Systems , 1996 .

[4]  Darren Erik,et al.  Supporting I/O-Efficient Scientific Computation In TPIE* , 1995 .

[5]  Thomas H. Cormen,et al.  Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems , 1998, SIAM J. Comput..

[6]  Lars Arge,et al.  Efficient External-Memory Data Structures and Applications , 1996, BRICS Dissertation Series.

[7]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[8]  Jeffrey Scott Vitter,et al.  Strategic directions in storage I/O issues in large-scale computing , 1996, CSUR.

[9]  Jeffrey Scott Vitter,et al.  External-Memory Algorithms for Processing Line Segments in Geographic Information Systems (Extended Abstract) , 1995, ESA.

[10]  Mikkel Thorup Randomized sorting in O(n log log n) time and linear space using addition, shift, and bit-wise boolean operations , 1997, SODA '97.

[11]  Rajeev Raman,et al.  Sorting in linear time? , 1995, STOC '95.

[12]  Torben Hagerup Optimal parallel string algorithms: sorting, merging and computing the minimum , 1994, STOC '94.

[13]  Roberto Grossi,et al.  Fast string searching in secondary storage: theoretical developments and experimental results , 1996, SODA '96.

[14]  Donald E. Knuth,et al.  Sorting and Searching , 1973 .

[15]  Donald E. Knuth,et al.  The Art of Computer Programming, Vol. 3: Sorting and Searching , 1974 .

[16]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[17]  Arnold L. Rosenberg,et al.  Rapid identification of repeated patterns in strings, trees and arrays , 1972, STOC.

[18]  Jyh-Jong Tsay,et al.  External-memory computational geometry , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[19]  Michael L. Fredman,et al.  Surpassing the Information Theoretic Bound with Fusion Trees , 1993, J. Comput. Syst. Sci..

[20]  Jeffrey Scott Vitter,et al.  Deterministic distribution sort in shared and distributed memory multiprocessors , 1993, SPAA '93.

[21]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[22]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[23]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[24]  Fabrizio Luccio,et al.  On the Parallel Dynamic Dictionary Matching Problem: New Results with Applications , 1996, ESA.

[25]  Lars Arge,et al.  A General Lower Bound on the I/O-Complexity of Comparison-based Algorithms , 1992, WADS.

[26]  Roberto Grossi,et al.  A fully-dynamic data structure for external substring search , 1995, STOC '95.

[27]  Karen A. Frenkel,et al.  The human genome project and informatics , 1991, CACM.

[28]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[29]  Kwan Woo Ryu,et al.  Sorting Strings and Constructing Digital Search Trees in Parallel , 1996, Theor. Comput. Sci..

[30]  Torben Hagerup,et al.  Merging and Sorting Strings in Parallel , 1992, MFCS.

[31]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[32]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[33]  Lars Arge,et al.  The I/O - Complexity of Ordered Binary - Decision Diagram Manipulation , 1995, ISAAC.

[34]  S. VitterJ.,et al.  Algorithms for parallel memory, I , 1994 .

[35]  Y.N. Patt,et al.  The I/O subsystem/spl minus/a candidate for improvement , 1994, Computer.

[36]  Gaston H. Gonnet,et al.  Handbook Of Algorithms And Data Structures , 1984 .

[37]  Jeffrey Scott Vitter,et al.  Greed sort: optimal deterministic sorting on parallel disks , 1995, JACM.

[38]  Arne Andersson,et al.  A new efficient radix sort , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[39]  Venkatesh Raman,et al.  Sorting Multisets and Vectors In-Place , 1991, WADS.

[40]  Edward F. Grove,et al.  External-memory graph algorithms , 1995, SODA '95.

[41]  Micah Adler,et al.  New coding techniques for improved bandwidth utilization , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[42]  Edward A. Fox,et al.  Digital libraries , 1995, CACM.

[43]  David R. Clark,et al.  Efficient suffix trees on secondary storage , 1996, SODA '96.

[44]  Kurt Mehlhorn,et al.  A new data structure for representing sorted lists , 1980, Acta Informatica.

[45]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[46]  Lars Arge,et al.  The Buffer Tree: A New Technique for Optimal I/O-Algorithms (Extended Abstract) , 1995, WADS.

[47]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[48]  Lars Arge,et al.  The Buuer Tree: a New Technique for Optimal I/o-algorithms ? , 1995 .

[49]  Yale N. Patt,et al.  The I/O Subsystem - A Candidate for Improvement: Guest Editor's Introduction , 1994, Computer.

[50]  Erik D Vengroff,et al.  I/O Efficient Scientific Computation Using TPIE , 1995 .

[51]  T. H. Merrett,et al.  Why sort-merge gives the best implementation of the natural join , 1983, SGMD.