Linkage of compound objects for supporting maintenance of large-scale web sites

Departments of organizations such as companies and universities tend to publish various information on their own Web sites. For example, descriptions of the members of a certain laboratory at a university may appear on the laboratory's Web site, the department's Web site, and so on. However, inconsistencies may occur between descriptions on these sites if their update timings and management policies are different. It is not easy to find such inconsistencies on large-scale Web sites, and the maintenance costs of doing so are huge. Record linkage techniques, which determine if two entities represented as relational records are approximately the same, have been developed as ways of identifying whether two entities are approximately the same. The current methods focus on simple objects, that are represented by individual records. But objects often consist of numerous simple objects; namely, they are often compound objects. For example, a research team object may contain several researcher objects. In this case, the research team object is a compound object, and the individual researcher objects are simple objects. The current record-level linkage methods can't detect such compound objects correctly when a record of one compound object doesn't match the record of the other. We propose novel methods of linking compound objects for supporting maintenance of large-scale Web sites. We first extract the relational records of Web objects by exploiting the structure of the Web pages they are on and the linguistic features of their descriptions. To find linkable compound objects that are constituted of simple objects, after the record-level linkage, we look at the compound objects' features, i.e., records continuity, common attribute values, and co-occurrences. Experimental results show that our method can detect compound objects that can't be detected by making only record-level linkages.

[1]  Divesh Srivastava,et al.  Linking temporal records , 2011, Frontiers of Computer Science.

[2]  Sumit Sarkar,et al.  Entity matching in heterogeneous databases: a distance-based decision model , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[3]  Bing Liu,et al.  A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction , 2010, SDM.

[4]  Divesh Srivastava,et al.  Group Linkage , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[5]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[6]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[7]  Jun Zhou,et al.  Multiple Instance Learning for Group Record Linkage , 2012, PAKDD.

[8]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[9]  Jeffrey Xu Yu,et al.  Entity Matching: How Similar Is Similar , 2011, Proc. VLDB Endow..

[10]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[11]  Hiroyuki Kitagawa,et al.  A constraint-based tool for data integrity management on the web , 2010, ICUIMC '10.

[12]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[14]  Debabrata Dey,et al.  Entity matching in heterogeneous databases: A logistic regression approach , 2008, Decis. Support Syst..

[15]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[16]  Yinsheng Li,et al.  Data extraction from web pages based on structural-semantic entropy , 2012, WWW.

[17]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[18]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[19]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[20]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[21]  Katsumi Tanaka,et al.  Identification of time-varying objects on the web , 2008, JCDL '08.

[22]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[23]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.