Building Support Tools for Russian-Language Information Extraction

There is currently a paucity of publicly available NLP tools to support analysis of Russian-language text. This especially concerns higher-level applications, such as Information Extraction. We present work on tools for information extraction from text in Russian in the domain of on-line news. On the lower level we employ the AOT toolkit for natural language processing, which provides modules for morphological analysis and partial syntactic chunking. Since the outputs of both lower-level modules contain unresolved ambiguity, we synthesize the outputs and pass the result into a pre-existing English-language analysis pipeline. We describe how the information extraction system is adapted for multilingual support, including extensions to the ontologies and to the pattern matching mechanism. While this is work in progress, we present an end-to-end pipeline for event extraction from Russian-language news.

[1]  Piskorski Jakub,et al.  Mining Massive Data Sets for Security , 2008 .

[2]  Roman Yangarber,et al.  Counter-Training in Discovery of Semantic Patterns , 2003, ACL.

[3]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[4]  Hercules Dalianis Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents , 2010 .

[5]  Jakub Piskorski,et al.  News mining for border security Intelligence , 2010, 2010 IEEE International Conference on Intelligence and Security Informatics.

[6]  Jakub Piskorski,et al.  Automated Event Extraction in the Domain of Border Security , 2009, UCMedia.

[7]  H. Cunningham,et al.  GATE : A Unicode-based Infrastructure Supporting Multilingual Information Extraction , 2003 .

[8]  Matti Vuorinen,et al.  Assessment of Utility in Web Mining for the Domain of Public Health , 2010, Louhi@NAACL-HLT.

[9]  Ralf Steinberger,et al.  Text Mining from the Web for Medical Intelligence , 2007, NATO ASI Mining Massive Data Sets for Security.

[10]  J. Linge,et al.  Internet surveillance systems for early alerting of health threats. , 2009, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[11]  R. Wilensky,et al.  Common LISPcraft , 1986 .

[12]  Jakub Piskorski,et al.  Real-time text mining in multilingual news for the creation of a pre-frontier intelligence picture , 2010, ISI-KDD '10.