Clustering Technique in Multi-Document Personal Name Disambiguation

Focusing on multi-document personal name disambiguation, this paper develops an agglomerative clustering approach to resolving this problem. We start from an analysis of point-wise mutual information between feature and the ambiguous name, which brings about a novel weight computing method for feature in clustering. Then a trade-off measure between within-cluster compactness and among-cluster separation is proposed for stopping clustering. After that, we apply a labeling method to find representative feature for each cluster. Finally, experiments are conducted on word-based clustering in Chinese dataset and the result shows a good effect.

[1]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[2]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[3]  Cheng Niu,et al.  Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction , 2004, ACL.

[4]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[5]  Hiroshi Nakagawa,et al.  Person Name Disambiguation in Web Pages Using Social Network, Compound Words and Latent Topics , 2008, PAKDD.

[6]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[7]  David W. Embley,et al.  Grouping search-engine returned citations for person-name queries , 2004, WIDM '04.

[8]  Ying Chen,et al.  Towards Robust Unsupervised Personal Name Disambiguation , 2007, EMNLP-CoNLL.

[9]  Ted Pedersen,et al.  Automatic Cluster Stopping with Criterion Functions and the Gap Statistic , 2006, NAACL.

[10]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[11]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[12]  Jiang Qingshan,et al.  A Hierarchical Method for Determining the Number of Clusters , 2007 .

[13]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[14]  Ted Pedersen,et al.  How Many Different "John Smiths", and Who Are They? , 2006, AAAI.