Automatically Utilizing Secondary Sources to Align Information Across Sources

XML, web services, and the semantic web have opened the door for new and exciting information-integration applications. Information sources on the web are controlled by different organizations or people, utilize different text formats, and have varying inconsistencies. Therefore, any system that integrates information from different data sources must identify common entities from these sources. Data from many data sources on the web does not contain enough information to link the records accurately using state-of-the-art record-linkage systems. However, it is possible to exploit secondary data sources on the web to improve the record-linkage process.We present an approach to accurately and automatically match entities from various data sources by utilizing a state-of-the-art record-linkage system in conjunction with a data-integration system. The data-integration system is able to automatically determine which secondary sources need to be queried when linking records from various data sources. In turn, the record-linkage system is then able to utilize this additional information to improve the accuracy of the linkage between datasets.

[1]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[2]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[3]  Jayant Madhavan,et al.  Composing Mappings Among Data Sources , 2003, VLDB.

[4]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[5]  Michael R. Genesereth,et al.  Infomaster: an information integration system , 1997, SIGMOD '97.

[6]  Craig A. Knoblock,et al.  Exploiting Secondary Sources for Automatic Object Consolidation , 2003 .

[7]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[8]  Vipul Kashyap,et al.  InfoSleuth: agent-based semantic integration of information in open and dynamic environments , 1997, SIGMOD '97.

[9]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[10]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[11]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[12]  Craig A. Knoblock,et al.  The Ariadne Approach to Web-Based Information Integration , 2001, Int. J. Cooperative Inf. Syst..

[13]  Jennifer Widom,et al.  Integrating and Accessing Heterogeneous Information Sources in TSIMMIS , 1994 .

[14]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[15]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[16]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[17]  Laura M. Haas,et al.  Schema Mapping as Query Discovery , 2000, VLDB.

[18]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[19]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[20]  Joann J. Ordille,et al.  Query-Answering Algorithms for Information Agents , 1996, AAAI/IAAI, Vol. 1.

[21]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[22]  Craig A. Knoblock,et al.  A View Integration Approach to Dynamic Composition of Web Services , 2003 .

[23]  Jiawei Han,et al.  Profile-Based Object Matching for Information Integration , 2003, IEEE Intell. Syst..

[24]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[25]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[26]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[27]  Naoki Abe,et al.  Query Learning Strategies Using Boosting and Bagging , 1998, ICML.