Record linkage for web data

Record linkage refers to the task of finding and linking records (in a single database or in a set of data sources) that refer to the same entity. Automating the record linkage process is a challenging problem, and has been the topic of extensive research for many years. Several tools and techniques have been developed as part of research prototypes and commercial software systems. However, the changing nature of the linkage process and the growing size of data sources create new challenges for this task. In this thesis, we study the record linkage problem for Web data sources. We show that traditional approaches to record linkage fail to meet the needs of Web data because 1) they do not permit users to easily tailor string matching algorithms to be useful over the highly heterogeneous and error-riddled string data on the Web and 2) they assume that the attributes required for record linkage are given. We propose novel solutions to address these shortcomings. First, we present a framework for record linkage over relational data, motivated by the fact that many Web data sources are powered by relational database engines. This framework is based on declarative specification of the linkage requirements by the user and allows linking records in many real-world scenarios. We present algorithms for translation of these requirements to queries that can run over a relational data source, potentially using a semantic knowledge base to enhance the accuracy of link discovery. Effective specification of requirements for linking records across multiples data sources requires understanding the schema of each source, identifying attributes that can be used for linkage, and their corresponding attributes in other sources. Existing approaches rely on schema or attribute matching, where the goal is aligning schemas, so attributes are matched if they play semantically related roles in their schemas. In contrast, we seek to find attributes that can be used to link records between data sources, which we refer to as linkage points. In this thesis, we define the notion of linkage point and present the first linkage point discovery algorithms. We then address the novel problem of how to publish Web data in a way that facilitates record linkage. We hypothesize that careful use of existing, curated Web sources (their data and structure) can guide the creation of conceptual models for semistructured Web data that in turn facilitate record linkage with these curated sources. Our solution is an end-to-end framework for data transformation and publication, which includes novel algorithms for identification of entity types (that are linkable) and their relationships out of semistructured Web data. A highlight of this thesis is showcasing the application of the proposed algorithms and frameworks in real applications and publishing the results as high-quality data sources on the Web.