论文信息 - Integrating Seed Names and ngrams for a Named Entity List and Classifier

Integrating Seed Names and ngrams for a Named Entity List and Classifier

We present a method for building a named-entity list and machine-learned named-entity classifier from a corpus of Dutch n ewspaper text, a rule-based named entity recognizer, and labeled seed name lists taken from the internet. The seed names, labeled either as PERSON, LOCATION, ORGANIZATION, or ADJECTIVAL name, are looked up i n a 83-million word corpus, and their immediate contexts are stored as instances of their label. The latter 8-grams are us ed by a decision-tree learning algorithm that, after traini ng, (i) can produce high-precision labeling of instances to be added to the seed lists, and (ii) more generally labels new, unseen names. Unlabeled namedentity types are labeled with a precision of 61% and a recall of 56%; aiming at optimizing precision, an overall precision of 83% can be obtained (a top precision of 88% on PERSON). On free text, named-entity token labeling accuracy is 71%.

Antal van den Bosch | Sabine Buchholz | S. Buchholz

[1] Yoram Singer,et al. Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[2] David Yarowsky,et al. Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence , 1999, EMNLP.

[3] C. R. Henson. Conclusion , 1969 .

[4] David S. Day,et al. Finite-state phrase parsing by rule sequences , 1996, COLING.