论文信息 - Identification and Tracing of Ambiguous Names: Discriminative and Generative Approaches

Identification and Tracing of Ambiguous Names: Discriminative and Generative Approaches

A given entity - representing a person, a location or an organization - may be mentioned in text in multiple, ambiguous ways. Understanding natural language requires identifying whether different mentions of a name, within and across documents, represent the same entity. We present two machine learning approaches to this problem, which we call the "Robust Reading" problem. Our first approach is a discriminative approach, trained in a supervised way. Our second approach is a generative model, at the heart of which is a view on how documents are generated and how names (of different entity types) are "sprinkled" into them. In its most general form, our model assumes: (1) a joint distribution over entities (e.g., a document that mentions "President Kennedy" is more likely to mention "Oswald" or "White House" than "Roger Clemens"), (2) an "author" model, that assumes that at least one mention of an entity in a document is easily identifiable, and then generates other mentions via (3) an appearance model, governing how mentions are transformed from tile "representative" mention. We show that both approaches perform very accurately, in the range of 90% - 95% F1 measure for different entity types, much better than previous approaches to (some aspects of) this problem. Our extensive experiments exhibit the contribution of relational and structural features and, somewhat surprisingly, that the assumptions made within our generative model are strong enough to yield a very powerful approach, that performs better than a supervised approach with limited supervised information.

[1] Salvatore J. Stolfo,et al. The merge/purge problem for large databases , 1995, SIGMOD '95.

[2] Hwee Tou Ng,et al. A Machine Learning Approach to Coreference Resolution of Noun Phrases , 2001, CL.

[3] Claire Gardent,et al. Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[4] Stuart J. Russell,et al. Identity Uncertainty and Citation Matching , 2002, NIPS.

[5] Andrew Kehler,et al. Coherence, reference, and the theory of grammar , 2002, CSLI lecture notes series.

[6] William W. Cohen,et al. Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[7] Ellen M. Voorhees,et al. Overview of TREC 2003 , 2003, TREC.

[8] Andrew McCallum,et al. Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference , 2003, IIWeb.

[9] David Yarowsky,et al. Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[10] Pradeep Ravikumar,et al. A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[11] Jiawei Han,et al. Profile-Based Object Matching for Information Integration , 2003, IEEE Intell. Syst..

[12] Raymond J. Mooney,et al. Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[13] Dan Roth,et al. Robust Reading: Identification and Tracing of Ambiguous Names , 2004, NAACL.