Semantic Integration in Text: From Ambiguous Names to Identifiable Entities

Semantic integration focuses on discovering, representing, and manipulating correspondences between entities in disparate data sources. The topic has been widely studied in the context of structured data, with problems being considered including ontology and schema matching, matching relational tuples, and reconciling inconsistent data values. In recent years, however, semantic integration over text has also received increasing attention. This article studies a key challenge in semantic integration over text: identifying whether different mentions of real-world entities, such as "JFK" and "John Kennedy," within and across natural language text documents, actually represent the same concept.We present a machine-learning study of this problem. The first approach is a discriminative approach--a pairwise local classifier is trained in a supervised way to determine whether two given mentions represent the same real-world entity. This is followed, potentially, by a global clustering algorithm that uses the classifier as its similarity metric. Our second approach is a global generative model, at the heart of which is a view on how documents are generated and how names (of different entity types) are "sprinkled" into them. In its most general form, our model assumes (1) a joint distribution over entities (for example, a document that mentions "President Kennedy" is more likely to mention "Oswald" or "White House" than "Roger Clemens"), and (2) an "author" model that assumes that at least one mention of an entity in a document is easily identifiable and then generates other mentions via (3) an "appearance" model that governs how mentions are transformed from the "representative" mention. We show that both approaches perform very accurately, in the range of 90-95 percent. F1 measure for different entity types, much better than previous approaches to some aspects of this problem. Finally, we discuss how our solution for mention matching in text can be potentially applied to matching relational tuples, as well as to linking entities across databases and text.

[1]  Andrew Kehler,et al.  Coherence, reference, and the theory of grammar , 2002, CSLI lecture notes series.

[2]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[3]  Dan Roth,et al.  Question-Answering via Enhanced Understanding of Questions , 2002, TREC.

[4]  Jiawei Han,et al.  Profile-Based Object Matching for Information Integration , 2003, IEEE Intell. Syst..

[5]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[6]  Sanda M. Harabagiu,et al.  Performance issues and error analysis in an open-domain question answering system , 2003, TOIS.

[7]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[8]  Renée J. Miller,et al.  Information-theoretic tools for mining database structure from large data sets , 2004, SIGMOD '04.

[9]  Andrew McCallum,et al.  Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference , 2003, IIWeb.

[10]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[11]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[12]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[13]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[14]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[15]  Amit P. Sheth,et al.  Semantic interoperability in global information systems , 1999, SGMD.

[16]  Hwee Tou Ng,et al.  A Machine Learning Approach to Coreference Resolution of Noun Phrases , 2001, CL.

[17]  Erhard Rahm,et al.  On Matching Schemas Automatically , 2001 .

[18]  Dan Roth,et al.  Robust Reading: Identification and Tracing of Ambiguous Names , 2004, NAACL.

[19]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[20]  Alon Halevy,et al.  SEMEX: Toward On-the-fly Personal Information Integration , 2004 .

[21]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[22]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[23]  Felix Naumann,et al.  Attribute classification using feature analysis , 2002, Proceedings 18th International Conference on Data Engineering.

[24]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[25]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[26]  Werner Bux,et al.  Performance Issues , 1983, Advanced Course: Local Area Networks.

[27]  William Jones Personal Information Management , 2007, Annu. Rev. Inf. Sci. Technol..

[28]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[29]  Andrew McCallum,et al.  Extracting social networks and contact information from email and the Web , 2004, CEAS.

[30]  Dan Roth,et al.  Relational Learning via Propositional Algorithms: An Information Extraction Case Study , 2001, IJCAI.

[31]  Jacek Gwizdka,et al.  Personal information management , 2004, CHI EA '04.

[32]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[33]  Dan Roth,et al.  Identification and Tracing of Ambiguous Names: Discriminative and Generative Approaches , 2004, AAAI.

[34]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[35]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Novelty Track. , 2005 .