论文信息 - Duplicate record elimination in large data files

Duplicate record elimination in large data files

The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified merge-sort procedure. The performance of this modified merge-sort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records. The results can also be used to provide critical input to a query optimizer in a relational database system.

David J. DeWitt | Dina Bitton | D. DeWitt | D. Bitton

[1] Edward Babb,et al. Implementing a relational database by means of specialzed hardware , 1979, TODS.

[2] Donald Ervin Knuth,et al. The Art of Computer Programming , 1968 .

[3] J. Ian Munro,et al. Sorting and Searching in Multisets , 1976, SIAM J. Comput..

[4] Irving L. Traiger,et al. System R: relational approach to database management , 1976, TODS.