Integration of heterogeneous databases without common domains using queries based on textual similarity

Most databases contain “name constants” like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both. In this paper, we reject the assumption that global domains can be easily constructed, and assume instead that the names are given in natural language text. We then propose a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. We describe an efficient implementation of WHIRL and evaluate it experimentally on data extracted from the World Wide Web. We show that WHIRL is much faster than naive inference methods, even for short queries. We also show that inferences made by WHIRL are surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outperforming exact matching with a plausible global domain on a second.

[1]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[2]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[3]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[4]  Nils J. Nilsson,et al.  Principles of Artificial Intelligence , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Gerald Salton,et al.  Automatic text processing , 1988 .

[6]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[7]  Doug Fang,et al.  The identification and resolution of semantic heterogeneity in multidatabase systems , 1991, [1991] Proceedings. First International Workshop on Interoperability in Multidatabase Systems.

[8]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[9]  Peter Schäuble,et al.  SPIDER: a multiuser information retrieval system for semistructured and dynamic data , 1993, SIGIR.

[10]  Richard E. Korf,et al.  Linear-Space Best-First Search , 1993, Artif. Intell..

[11]  Garrison W. Cottrell,et al.  Automatic combination of multiple ranked retrieval systems , 1994, SIGIR '94.

[12]  Scott B. Huffman,et al.  Heuristic Joins to Integrate Structured Hetrogeneous Data , 1995 .

[13]  Norbert Fuhr,et al.  Probabilistic Datalog—a logic for powerful retrieval methods , 1995, SIGIR '95.

[14]  Surajit Chaudhuri,et al.  Join queries with external text sources: execution and optimization techniques , 1995, SIGMOD '95.

[15]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[16]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[17]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[18]  Joann J. Ordille,et al.  Query-Answering Algorithms for Information Agents , 1996, AAAI/IAAI, Vol. 1.

[19]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[20]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[21]  Dan Suciu,et al.  Query Decomposition and View Maintenance for Query Languages for Unstructured Data , 1996, VLDB.

[22]  E. Monge,et al.  The Eld Matching Problem: Algorithms and Applications , 1996 .

[23]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[24]  Yoram Singer,et al.  Learning to Order Things , 1997, NIPS.

[25]  William W. Cohen Knowledge integration for structured information sources containing text (extended abstract) , 1997 .

[26]  Philippe Bonnet,et al.  The distributed information search component (Disco) and the World Wide Web , 1997, SIGMOD '97.

[27]  Vipul Kashyap,et al.  InfoSleuth: agent-based semantic integration of information in open and dynamic environments , 1997, SIGMOD '97.

[28]  G. Moerkotte,et al.  RAW : a Relational Algebra for the Web , 1997 .

[29]  Alberto O. Mendelzon,et al.  Formal models of Web queries , 1997, Inf. Syst..

[30]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[31]  Michael R. Genesereth,et al.  Query planning in infomaster , 1997, SAC '97.

[32]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[33]  Craig A. Knoblock,et al.  Query processing in the SIMS information mediator , 1997 .

[34]  Paolo Merialdo,et al.  Semistructured and structured data in the Web: going back and forth , 1997, SGMD.

[35]  Michael R. Genesereth,et al.  Answering recursive queries using views , 1997, PODS '97.

[36]  Serge Abiteboul,et al.  Regular path queries with constraints , 1997, PODS '97.

[37]  William W. Cohen A Web-based information system that reasons with structured collections of text , 1998, AGENTS '98.

[38]  William W. Cohen,et al.  Context-sensitive learning methods for text categorization , 1999, TOIS.

[39]  Dayne Freitag,et al.  A Machine Learning Architecture for Optimizing Web Search Engines , 1999 .

[40]  N. James Automatic Linkage of Vital Records Computers , 2022 .