Integrating Seed Names and ngrams for a Named Entity List and Classifier

We present a method for building a named-entity list and machine-learned named-entity classifier from a corpus of Dutch n ewspaper text, a rule-based named entity recognizer, and labeled seed name lists taken from the internet. The seed names, labeled either as PERSON, LOCATION, ORGANIZATION, or ADJECTIVAL name, are looked up i n a 83-million word corpus, and their immediate contexts are stored as instances of their label. The latter 8-grams are us ed by a decision-tree learning algorithm that, after traini ng, (i) can produce high-precision labeling of instances to be added to the seed lists, and (ii) more generally labels new, unseen names. Unlabeled namedentity types are labeled with a precision of 61% and a recall of 56%; aiming at optimizing precision, an overall precision of 83% can be obtained (a top precision of 88% on PERSON). On free text, named-entity token labeling accuracy is 71%.