论文信息 - Context-Aware Duplicate Detection in Semi-structured Data Streams

Context-Aware Duplicate Detection in Semi-structured Data Streams

State-of-the-art in duplicate detection in semi-structured data obtains significant improvement by exploiting the schema-related knowledge. Such schema-bound duplicate detection approaches, however, have severe limitations when dealing with multi-sourced, heterogeneous, high-velocity data streams. In this paper, we propose a novel context-aware duplicate detection system which is workload- and complexity-aware, and is adaptable to the underlying computing platform. The system operates in schema-oblivious manner, and relies upon information theory based heuristic and data shaping technique for efficient, and scalable duplicate detection in multi-sourced, heterogeneous data sets. Experiments with real-world data sets show speed up of up to 8X over state of-the-art schemes, while maintaining upto 92 percent accuracy. In addition, our data shaping technique for GPGPU processing speeds up the duplicate detection throughput by up to two orders of magnitude.

Arun K. Somani | Parijat Shukla | Arun Kumar Somani | Parijat Shukla

[1] Pável Calado,et al. Structure-based inference of xml similarity for fuzzy duplicate detection , 2007, CIKM '07.

[2] Vassilis J. Tsotras,et al. Efficient and Scalable Sequence-Based XML Filtering , 2009, WebDB.

[3] Pável Calado,et al. Duplicate detection through structure optimization , 2011, CIKM '11.

[4] W. Marsden. I and J , 2012 .

[5] Pável Calado,et al. Efficient XML duplicate detection using an adaptive two-level optimization , 2013, SAC '13.

[6] Felix Naumann,et al. XML Duplicate Detection Using Sorted Neighborhoods , 2006, EDBT.

[7] Tiziana Catarci,et al. Structure-aware XML Object Identification , 2006, IEEE Data Eng. Bull..

[8] Gonzalo Navarro,et al. A guided tour to approximate string matching , 2001, CSUR.

[9] Felix Naumann,et al. An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[10] Pável Calado,et al. An automatic blocking strategy for XML duplicate detection , 2013, SIAP.

[11] Felix Naumann,et al. DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[12] Curtis E. Dyreson,et al. Approximate Joins for Data-Centric XML , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[13] Pável Calado,et al. Efficient and Effective Duplicate Detection in Hierarchical Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[14] Erhard Rahm,et al. Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[15] Sudipto Guha,et al. Approximate XML joins , 2002, SIGMOD '02.

[16] Zhengxin Chen,et al. Duplicate detection using k-way sorting method , 2000, SAC '00.

[17] Salvatore J. Stolfo,et al. The merge/purge problem for large databases , 1995, SIGMOD '95.

[18] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[19] Philip Bille,et al. A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[20] Fatma Ozcan. Proceedings of the 2005 ACM SIGMOD international conference on Management of data , 2005, SIGMOD 2005.