Web Trace Duplication Detection Based on Context

Data Integration becomes more and more important with the rapidly spread of the internet and the study on entity trace becomes more and more important as a part of it. The entity trace is mainly extracted from the text fragments. There will be much duplication in the records because of the large scale, strong autonomy and the high redundancy features of the web sources. The processing of this problem often carries semantic features, which results in that the traditional integration method cannot be applied on it directly. In this paper, we propose a web trace duplication detection method based on unsupervised learning and context. We address the problem above by a new process on computing the comparison vector between two records based on the context, then acquiring the sample data automatically, training the classifiers with the sample data, and finally classifying the records.

[1]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[2]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[3]  Wu Jian Web Service Discovery Based on Ontology and Similarity of Words , 2005 .

[4]  Weifeng Su,et al.  Record Matching over Query Results from Multiple Web Databases , 2010, IEEE Transactions on Knowledge and Data Engineering.

[5]  Li Dong,et al.  A Deep Web Entity Identification Mechanism Based on Semantics and Statistical Analysis , 2008 .

[6]  Li Xuandong,et al.  UMLTGF: A Tool for Generating Test Cases from UML Activity Diagrams Based on Grey-Box Method , 2006 .

[7]  Yue Kou A Deep Web Entity Identification Mechanism Based on Semantics and Statistical Analysis: A Deep Web Entity Identification Mechanism Based on Semantics and Statistical Analysis , 2008 .

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  Sanjay Chawla,et al.  Robust record linkage blocking using suffix arrays , 2009, CIKM.

[10]  Piek T. J. M. Vossen,et al.  SemEval-2010 Task 17: All-Words Word Sense Disambiguation on a Specific Domain , 2009, *SEMEVAL.

[11]  Sun He-li An adaptive similarity learning approach to record linkage , 2007 .

[12]  R. Fildes Journal of the Royal Statistical Society (B): Gary K. Grunwald, Adrian E. Raftery and Peter Guttorp, 1993, “Time series of continuous proportions”, 55, 103–116.☆ , 1993 .