Discovering identities in web contexts with unsupervised clustering

We describe the application of unsupervised clustering methodologies to the problem of discriminating among ambiguous names found in short passages of text that appear on Web pages. We show how to tailor these methods to handle the very noisy data that we typically find on the Web. We experiment with several variations in feature selection, two methods that automatically determine the number of clusters in the data, two different representations of the contexts to be discriminated, and with dimensionality reduction. Our evaluation is carried out using Web contexts for five different ambiguous names that were manually disambiguated to use as a gold standard.

[1]  Ted Pedersen,et al.  Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces , 2004, CoNLL.

[2]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[3]  Ted Pedersen,et al.  Distinguishing Word Senses in Untagged Text , 1997, EMNLP.

[4]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[5]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[6]  Ted Pedersen,et al.  Name Discrimination by Clustering Similar Contexts , 2005, CICLing.

[7]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[8]  Esther Levin,et al.  Evaluation of Utility of LSA for Word Sense Discrimination , 2006, HLT-NAACL.

[9]  Ted Pedersen,et al.  An Unsupervised Language Independent Method of Name Discrimination Using Second Order Co-occurrence Features , 2006, CICLing.

[10]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[11]  Ted Pedersen,et al.  Name Discrimination and Email Clustering using Unsupervised Clustering and Labeling of Similar Contexts , 2005, IICAI.

[12]  Ted Pedersen,et al.  Selecting the “Right” Number of Senses Based on Clustering Criterion Functions , 2006, EACL.

[13]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[14]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[15]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[16]  Hinrich Sch Automatic Word Sense Discrimination , 1998 .