Data integration using similarity joins and a word-based information representation language

The integration of distributed, heterogeneous databases, such as those available on the World Wide Web, poses many problems. Herer we consider the problem of integrating data from sources that lack common object identifiers. A solution to this problem is proposed for databases that contain informal, natural-language “names” for objects; most Web-based databases satisfy this requirement, since they usually present their information to the end-user through a veneer of text. We describe WHIRL, a “soft” database management system which supports “similarity joins,” based on certain robust, general-purpose similarity metrics for text. This enables fragments of text (e.g., informal names of objects) to be used as keys. WHIRL includes textual objects as a built-in type, similarity reasoning as a built-in predicate, and answers every query with a list of answer substitutions that are ranked according to an overall score. Experiments show that WHIRL is much faster than naive inference methods, even for short queries, and efficient on typical queries to real-world databases with tens of thousands of tuples. Inferences made by WHIRL are also surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outerperforming exact matching with a plausible global domain on a second.

[1]  Alberto O. Mendelzon,et al.  Formal models of Web queries , 1997, Inf. Syst..

[2]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[3]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[4]  Lotfi A. Zadeh,et al.  Fuzzy Sets , 1996, Inf. Control..

[5]  Surajit Chaudhuri,et al.  Join queries with external text sources: execution and optimization techniques , 1995, SIGMOD '95.

[6]  William W. Cohen Knowledge integration for structured information sources containing text (extended abstract) , 1997 .

[7]  Michael R. Genesereth,et al.  Query planning in infomaster , 1997, SAC '97.

[8]  Christine L. Borgman,et al.  Getty's Synoname™ and its cousins: A survey of applications of personal name‐matching algorithms , 1992 .

[9]  William W. Cohen A Web-based information system that reasons with structured collections of text , 1998, AGENTS '98.

[10]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[11]  Garrison W. Cottrell,et al.  Automatic combination of multiple ranked retrieval systems , 1994, SIGIR '94.

[12]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[13]  Norbert Fuhr,et al.  Probabilistic Datalog—a logic for powerful retrieval methods , 1995, SIGIR '95.

[14]  Serge Abiteboul,et al.  Regular path queries with constraints , 1997, PODS '97.

[15]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[16]  Dan Suciu Management of semistructured data , 1997, SGMD.

[17]  Doug Fang,et al.  The identification and resolution of semantic heterogeneity in multidatabase systems , 1991, [1991] Proceedings. First International Workshop on Interoperability in Multidatabase Systems.

[18]  Joel L. Fagan The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval , 1989 .

[19]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[20]  Dayne Freitag,et al.  A Machine Learning Architecture for Optimizing Web Search Engines , 1999 .

[21]  Gerald Salton,et al.  Automatic text processing , 1988 .

[22]  Craig A. Knoblock,et al.  Query processing in the SIMS information mediator , 1997 .

[23]  Vipul Kashyap,et al.  InfoSleuth: agent-based semantic integration of information in open and dynamic environments , 1997, SIGMOD '97.

[24]  Yoram Singer,et al.  Learning to Order Things , 1997, NIPS.

[25]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[26]  Paolo Merialdo,et al.  Semistructured and structured data in the Web: going back and forth , 1997, SGMD.

[27]  Henri Prade,et al.  An Introduction to the Fuzzy Set and Possibility Theory-Based Treatment of Flexible Queries and Uncertain or Imprecise Databases , 1996, Uncertainty Management in Information Systems.

[28]  Judea Pearl,et al.  Heuristics : intelligent search strategies for computer problem solving , 1984 .

[29]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[30]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[31]  Philippe Bonnet,et al.  The distributed information search component (Disco) and the World Wide Web , 1997, SIGMOD '97.

[32]  G. Moerkotte,et al.  RAW : a Relational Algebra for the Web , 1997 .

[33]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[34]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[35]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[36]  Pat Langley Editorial: Advice to Machine Learning Authors , 2005, Machine Learning.

[37]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[38]  Scott B. Huffman,et al.  Heuristic Joins to Integrate Structured Hetrogeneous Data , 1995 .

[39]  Nils J. Nilsson,et al.  Principles of Artificial Intelligence , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  William W. Cohen,et al.  Joins that Generalize: Text Classification Using WHIRL , 1998, KDD.

[41]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[42]  Dan Suciu,et al.  Query Decomposition and View Maintenance for Query Languages for Unstructured Data , 1996, VLDB.

[43]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[44]  Christine L. Borgman,et al.  Getty's Synoname and Its Cousins: A Survey of Applications of Personal Name-Matching Algorithms , 1992, J. Am. Soc. Inf. Sci..

[45]  Michael R. Genesereth,et al.  Answering recursive queries using views , 1997, PODS '97.

[46]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[47]  Joel L. Fagan,et al.  The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval , 1989, JASIS.

[48]  Ronald Fagin,et al.  Fuzzy queries in multimedia database systems , 1998, PODS '98.

[49]  Richard E. Korf,et al.  Linear-Space Best-First Search , 1993, Artif. Intell..

[50]  H. Prade,et al.  An introduction to the fuzzy set and possibility theory-based treatment of soft queries and uncertain or imprecise databases , 1994 .

[51]  Peter Schäuble,et al.  SPIDER: a multiuser information retrieval system for semistructured and dynamic data , 1993, SIGIR.

[52]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[53]  Joann J. Ordille,et al.  Query-Answering Algorithms for Information Agents , 1996, AAAI/IAAI, Vol. 1.