To combine information from heterogeneous sources, equivalent data in the multiple sources must be identified. This task is the field matching problem. Specifically, the task is to determine whether or not two syntactic values are alternative designations of the same semantic entity. For example the addresses Dept. of Comput. Sci. and Eng., University of California, San Diego, 9500 Gilman Dr. Dept. 0114, La Jolla. CA 92093 and UCSD, Computer Science and Engineering Department, CA 92093-0114 do designate the same department. This paper describes three field matching algorithms, and evaluates their performance on real-world datasets. One proposed method is the well-known Smith-Waterman algorithm for comparing DNA and protein sequences. Several applications of field matching in knowledge discovery are described briefly, including WEBFIND, which is a new software tool that discovers scientific papers published on the worldwide web. WEBFIND uses external information sources to guide its search for authors and papers. Like many other worldwide web tools, WEBFIND needs to solve the field matching problem in order to navigate between information sources.
[1]
M S Waterman,et al.
Identification of common molecular subsequences.
,
1981,
Journal of molecular biology.
[2]
Michael McGill,et al.
Introduction to Modern Information Retrieval
,
1983
.
[3]
C. Batini,et al.
A comparative analysis of methodologies for database schema integration
,
1986,
CSUR.
[4]
Bradley E. Slaven,et al.
The Set Theory Matching System: An Application to Ethnographic Research
,
1992
.
[5]
Christian Jacquemin,et al.
Retrieving terms and their variants in a lexicalized unification-based framework
,
1994,
SIGIR '94.
[6]
Calton Pu,et al.
Applying an information gathering architecture to Netfind: a white pages tool for a changing and growing Internet
,
1994,
TNET.
[7]
Oren Etzioni,et al.
Category Translation: Learning to Understand Information on the Internet
,
1995,
IJCAI.
[8]
Salvatore J. Stolfo,et al.
The merge/purge problem for large databases
,
1995,
SIGMOD '95.