论文信息 - Extending an Information Extraction tool set to Central and Eastern European languages

Extending an Information Extraction tool set to Central and Eastern European languages

Abstract In a highly multilingual and multi-cultural environment such as in the European Commission with soon over twenty official languages, there is an ur-gent need for text analysis tools that use minimal linguistic knowledge so that they can be adapted to many languages without much human effort. We are pre-senting two such Information Extraction tools that have already been adapted to various Western and Eastern European languages: one for the recognition of date expressions in text, and one for the detection of geographical place names and the visualisation of the results in geographical maps. An evaluation of the performance has produced very satisfy-ing results. 1 Introduction The international staff of the European Commis-sion (EC), like any other multinational organisa-tion, has to deal with documents written in many different languages. Multilingual text analysis tools can help them to be more efficient and to get access to information written in documents they may not understand. However, not many commercial text analysis tools exist that can ana-lyse texts in all official European Union (EU) languages, and we do not know of any tool that covers all of the over 20 languages that will be used after the planned Enlargement of the EU. The

[1] Emmanuel Morin,et al. Reconnaissance automatique des noms propres de la langue écrite : Les récentes réalisations , 2000 .

[2] Marc Moens,et al. Named Entity Recognition without Gazetteers , 1999, EACL.

[3] Ralf Steinberger,et al. Continuous Multi-Source Information Gathering and Classification , 2003 .

[4] Bruno Pouliquen,et al. Automatic annotation of multilingual text collections with a conceptual thesaurus , 2006, ArXiv.

[5] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[6] Denis Maurel,et al. Textual Similarity based on Proper Names , 2002 .

[7] Yorick Wilks,et al. How feasible is the reuse of grammars for Named Entity Recognition? , 2002, LREC.

[8] Ted E. Dunning,et al. Statistical Identification of Language , 1994 .

[9] Fredric C. Gey. Research to Improve Cross-Language Retrieval - Position Paper for CLEF , 2000, CLEF.