Context-Aware Duplicate Detection in Semi-structured Data Streams

State-of-the-art in duplicate detection in semi-structured data obtains significant improvement by exploiting the schema-related knowledge. Such schema-bound duplicate detection approaches, however, have severe limitations when dealing with multi-sourced, heterogeneous, high-velocity data streams. In this paper, we propose a novel context-aware duplicate detection system which is workload- and complexity-aware, and is adaptable to the underlying computing platform. The system operates in schema-oblivious manner, and relies upon information theory based heuristic and data shaping technique for efficient, and scalable duplicate detection in multi-sourced, heterogeneous data sets. Experiments with real-world data sets show speed up of up to 8X over state of-the-art schemes, while maintaining upto 92 percent accuracy. In addition, our data shaping technique for GPGPU processing speeds up the duplicate detection throughput by up to two orders of magnitude.

[1]  Pável Calado,et al.  Structure-based inference of xml similarity for fuzzy duplicate detection , 2007, CIKM '07.

[2]  Vassilis J. Tsotras,et al.  Efficient and Scalable Sequence-Based XML Filtering , 2009, WebDB.

[3]  Pável Calado,et al.  Duplicate detection through structure optimization , 2011, CIKM '11.

[4]  W. Marsden I and J , 2012 .

[5]  Pável Calado,et al.  Efficient XML duplicate detection using an adaptive two-level optimization , 2013, SAC '13.

[6]  Felix Naumann,et al.  XML Duplicate Detection Using Sorted Neighborhoods , 2006, EDBT.

[7]  Tiziana Catarci,et al.  Structure-aware XML Object Identification , 2006, IEEE Data Eng. Bull..

[8]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[9]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[10]  Pável Calado,et al.  An automatic blocking strategy for XML duplicate detection , 2013, SIAP.

[11]  Felix Naumann,et al.  DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[12]  Curtis E. Dyreson,et al.  Approximate Joins for Data-Centric XML , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[13]  Pável Calado,et al.  Efficient and Effective Duplicate Detection in Hierarchical Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[14]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[15]  Sudipto Guha,et al.  Approximate XML joins , 2002, SIGMOD '02.

[16]  Zhengxin Chen,et al.  Duplicate detection using k-way sorting method , 2000, SAC '00.

[17]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[18]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[19]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[20]  Fatma Ozcan Proceedings of the 2005 ACM SIGMOD international conference on Management of data , 2005, SIGMOD 2005.