论文信息 - Unsupervised Discrimination of Person Names in Web Contexts

Unsupervised Discrimination of Person Names in Web Contexts

Ambiguous person names are a problem in many forms of written text, including that which is found on the Web. In this paper we explore the use of unsupervised clustering techniques to discriminate among entities named in Web pages. We examine three main issues via an extensive experimental study. First, the effect of using a held---out set of training data for feature selection versus using the data in which the ambiguous names occur. Second, the impact of using different measures of association for identifying lexical features. Third, the success of different cluster stopping measures that automatically determine the number of clusters in the data.

Ted Pedersen | Anagha Kulkarni | Ted Pedersen | Anagha Kulkarni

[1] J. I. The Design of Experiments , 1936, Nature.

[2] Zellig S. Harris,et al. Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[3] G. Miller,et al. Contextual correlates of semantic similarity , 1991 .

[4] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[5] Ted Pedersen,et al. Fishing for Exactness , 1996, ArXiv.

[6] Hinrich Schütze,et al. Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[7] Robert Tibshirani,et al. Estimating the number of clusters in a data set via the gap statistic , 2000 .

[8] Ted Pedersen,et al. Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces , 2004, CoNLL.

[9] Ted Pedersen,et al. Name Discrimination by Clustering Similar Contexts , 2005, CICLing.

[10] Ted Pedersen,et al. Name Discrimination and Email Clustering using Unsupervised Clustering and Labeling of Similar Contexts , 2005, IICAI.

[11] Esther Levin,et al. Evaluation of Utility of LSA for Word Sense Discrimination , 2006, HLT-NAACL.

[12] Ted Pedersen,et al. An Unsupervised Language Independent Method of Name Discrimination Using Second Order Co-occurrence Features , 2006, CICLing.

[13] Ted Pedersen,et al. Selecting the “Right” Number of Senses Based on Clustering Criterion Functions , 2006, EACL.