Multilingual Named-Entity Recognition from Parallel Corpora

We present a named-entity recognition (NER) system for parallel multilingual text. Our system handles three languages (i.e., English, French, and Spanish) and is tailored to the biomedical domain. For each language, we design a supervised knowledge-based CRF model with rich biomedical and general domain information. We use the sentence alignment of the parallel corpora, the word alignment generated by the GIZA++[8] tool, and Wikipedia-based word alignment in order to transfer system predictions made by individual language models to the remaining parallel languages. We re-train each individual language system using the transferred predictions and generate a final enriched NER model for each language. The enriched system performs better than the initial system based on the predictions transferred from the other language systems. Each language model benefits from the external knowledge extracted from biomedical and general domain resources.