论文信息 - An efficient approach for data-duplication detection based on RDBMS

An efficient approach for data-duplication detection based on RDBMS

Data-duplication is one of the most important issues in the context of information system management. Instead of storing a single real-world object as an entity in an information system, the duplication, storing more than one entity representing a single object, can be occurred. This problem can decrease the quality of service of information systems. In this paper, we propose an efficient approach to detect the duplication based on the RDBMS foundation. Our approach is based on the assumption that the data to be processed have been stored in the RDBMS at the first place. Thus, the proposed approach does not require the data to be imported/exported from the storage. Also, such approach will benefit from the query optimizer of the RDBMS. The experiment results on the TPC-H dataset have been presented to validate such proposed work.

Juggapong Natwichai | Kiettisak Chanhom

[1] Raymond J. Mooney,et al. Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[2] Esko Ukkonen,et al. Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[3] Felix Naumann,et al. Industry-scale duplicate detection , 2008, Proc. VLDB Endow..

[4] Luis Gravano,et al. Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[5] Ahmed K. Elmagarmid,et al. Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6] Gonzalo Navarro,et al. A guided tour to approximate string matching , 2001, CSUR.

[7] C. Lee Giles,et al. Adaptive sorted neighborhood methods for efficient record linkage , 2007, JCDL '07.