Duplicate record elimination in large data files
暂无分享,去创建一个
The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified merge-sort procedure. The performance of this modified merge-sort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records. The results can also be used to provide critical input to a query optimizer in a relational database system.
[1] Edward Babb,et al. Implementing a relational database by means of specialzed hardware , 1979, TODS.
[2] Donald Ervin Knuth,et al. The Art of Computer Programming , 1968 .
[3] J. Ian Munro,et al. Sorting and Searching in Multisets , 1976, SIAM J. Comput..
[4] Irving L. Traiger,et al. System R: relational approach to database management , 1976, TODS.