论文信息 - How Many Different "John Smiths", and Who Are They?

How Many Different "John Smiths", and Who Are They?

In this work we propose three unsupervised measures to automatically identify the number of distinct entities a given ambiguous name refers to in a corpus. We experiment with 22 artificially created name conflations and observe that the measure (PK2) formulated as the ratio of two successive clustering criterion function values outperforms the other two measures. We also describe a method to assign a unique label to each discovered cluster so as to identify the underlying entity that it refers to.

Ted Pedersen | Anagha Kulkarni | Ted Pedersen | Anagha Kulkarni

[1] R. Mojena,et al. Hierarchical Grouping Methods and Stopping Rules: An Evaluation , 1977, Comput. J..

[2] Ted Pedersen,et al. Name Discrimination by Clustering Similar Contexts , 2005, CICLing.

[3] Robert Tibshirani,et al. Estimating the number of clusters in a data set via the gap statistic , 2000 .

[4] John A. Hartigan,et al. Clustering Algorithms , 1975 .

[5] Hinrich Schütze,et al. Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[6] Ted Pedersen,et al. Distinguishing Word Senses in Untagged Text , 1997, EMNLP.

[7] Hinrich Sch. Automatic Word Sense Discrimination , 1998 .