论文信息 - An Unsupervised Language Independent Method of Name Discrimination Using Second Order Co-occurrence Features

An Unsupervised Language Independent Method of Name Discrimination Using Second Order Co-occurrence Features

Previous work by Pedersen, Purandare and Kulkarni (2005) has resulted in an unsupervised method of name discrimination that represents the context in which an ambiguous name occurs using second order co–occurrence features. These contexts are then clustered in order to identify which are associated with different underlying named entities. It also extracts descriptive and discriminating bigrams from each of the discovered clusters in order to serve as identifying labels. These methods have been shown to perform well with English text, although we believe them to be language independent since they rely on lexical features and use no syntactic features or external knowledge sources. In this paper we apply this methodology in exactly the same way to Bulgarian, English, Romanian, and Spanish corpora. We find that it attains discrimination accuracy that is consistently well above that of a majority classifier, thus providing support for the hypothesis that the method is language independent.

Ted Pedersen | Zornitsa Kozareva | Thamar Solorio | Anagha Kulkarni | Roxana Angheluta

[1] Preslav Nakov,et al. Category-based Pseudowords , 2003, HLT-NAACL.

[2] Vasileios Hatzivassiloglou,et al. Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[3] Hinrich Schütze,et al. Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[4] Ted Pedersen,et al. Name Discrimination by Clustering Similar Contexts , 2005, CICLing.

[5] Tanja Gaustad,et al. Statistical Corpus-Based Word Sense Disambiguation: Pseudowords vs. Real Ambiguous Words , 2001, ACL.

[6] Tapio Salakoski,et al. New Techniques for Disambiguation in Natural Language and Their Application to Biological Text , 2004, J. Mach. Learn. Res..

[7] Ted Pedersen,et al. Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces , 2004, CoNLL.

[8] James Allan,et al. Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[9] David Yarowsky,et al. Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[10] Breck Baldwin,et al. Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[11] Amruta Purandare. Discriminating Among Word Senses Using McQuitty's Similarity Analysis , 2003, HLT-NAACL.