Improved Unsupervised Name Discrimination with Very Wide Bigrams and Automatic Cluster Stopping

We cast name discrimination as a problem in clustering short contexts. Each occurrence of an ambiguous name is treated independently, and represented using second---order context vectors. We calibrate our approach using a manually annotated collection of five ambiguous names from the Web, and then apply the learned parameter settings to three held-out sets of pseudo-name data that have been reported on in previous publications. We find that significant improvements in the accuracy of name discrimination can be achieved by using very wide bigrams, which are ordered pairs of words with up to 48 intervening words between them. We also show that recent developments in automatic cluster stopping can be used to predict the number of underlying identities without any significant loss of accuracy as compared to previous approaches which have set these values manually.

[1]  Ted Pedersen,et al.  Discovering identities in web contexts with unsupervised clustering , 2007 .

[2]  Esther Levin,et al.  Evaluation of Utility of LSA for Word Sense Discrimination , 2006, HLT-NAACL.

[3]  Ted Pedersen,et al.  Unsupervised Discrimination of Person Names in Web Contexts , 2009, CICLing.

[4]  Ted Pedersen,et al.  Selecting the “Right” Number of Senses Based on Clustering Criterion Functions , 2006, EACL.

[5]  Ted Pedersen,et al.  Name Discrimination by Clustering Similar Contexts , 2005, CICLing.

[6]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[7]  Ted Pedersen,et al.  Name Discrimination and Email Clustering using Unsupervised Clustering and Labeling of Similar Contexts , 2005, IICAI.

[8]  Ted Pedersen,et al.  Significant Lexical Relationships , 1996, AAAI/IAAI, Vol. 1.

[9]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[10]  Ted Pedersen,et al.  An Unsupervised Language Independent Method of Name Discrimination Using Second Order Co-occurrence Features , 2006, CICLing.

[11]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[12]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[13]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.