Reliable and Consistent Data De-duplication In Hierarchical Structure

Although there exists a long distinctive line of work with identifying duplicates in relational files, only a couple of solutions give attention to duplicate diagnosis in more advanced hierarchical buildings, like XML files. In this paper, current a novel way of XML duplicate detection, termed XMLDup. XMLDup works on the Bayesian network to look for the probability connected with two XML aspects being duplicates, considering besides the information in the elements, but also the approach that details is structured. In inclusion, to help the efficiency in the network evaluate, a book pruning technique, capable connected with significant gains on the un optimized version in the algorithm, is actually presented. As a result of experiments, our algorithm is able to achieve high precision as well as recall scores in several data pieces. KeywordsDuplicate Detection, Data Cleaning, XML, Bayesian network , Network Pruning