论文信息 - Reliable and Consistent Data De-duplication In Hierarchical Structure

Reliable and Consistent Data De-duplication In Hierarchical Structure

Although there exists a long distinctive line of work with identifying duplicates in relational files, only a couple of solutions give attention to duplicate diagnosis in more advanced hierarchical buildings, like XML files. In this paper, current a novel way of XML duplicate detection, termed XMLDup. XMLDup works on the Bayesian network to look for the probability connected with two XML aspects being duplicates, considering besides the information in the elements, but also the approach that details is structured. In inclusion, to help the efficiency in the network evaluate, a book pruning technique, capable connected with significant gains on the un optimized version in the algorithm, is actually presented. As a result of experiments, our algorithm is able to achieve high precision as well as recall scores in several data pieces. KeywordsDuplicate Detection, Data Cleaning, XML, Bayesian network , Network Pruning

M. Venugopal Reddy | P. Niranjan Reddy | M. Krishna Kumar

[1] Dmitri V. Kalashnikov,et al. Domain-independent data cleaning via analysis of entity-relationship graph , 2006, TODS.

[2] Judea Pearl,et al. Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[3] Felix Naumann,et al. DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[4] Pável Calado,et al. Efficient and Effective Duplicate Detection in Hierarchical Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[5] Pável Calado,et al. Structure-based inference of xml similarity for fuzzy duplicate detection , 2007, CIKM '07.

[6] Zhonghui Xu,et al. Inherited Feature-based Similarity Measure Based on Large Semantic Hierarchy and Large Text Corpus , 1996, COLING.