Knowledge-Driven Event Extraction in Russian: Corpus-Based Linguistic Resources

Automatic event extraction form text is an important step in knowledge acquisition and knowledge base population. Manual work in development of extraction system is indispensable either in corpus annotation or in vocabularies and pattern creation for a knowledge-based system. Recent works have been focused on adaptation of existing system (for extraction from English texts) to new domains. Event extraction in other languages was not studied due to the lack of resources and algorithms necessary for natural language processing. In this paper we define a set of linguistic resources that are necessary in development of a knowledge-based event extraction system in Russian: a vocabulary of subordination models, a vocabulary of event triggers, and a vocabulary of Frame Elements that are basic building blocks for semantic patterns. We propose a set of methods for creation of such vocabularies in Russian and other languages using Google Books NGram Corpus. The methods are evaluated in development of event extraction system for Russian.

[1]  Katharina Morik,et al.  Enhanced Services for Targeted Information Retrieval by Event Extraction and Data Mining , 2008, LWA.

[2]  Vladimir Ivanov,et al.  Composite Event Indicator Processing in Event Extraction for Non-configurational Language , 2013, MICAI.

[3]  Bao-Quoc Ho,et al.  A Hybrid approach for biomedical event extraction , 2013, BioNLP@ACL.

[4]  Emanuele Pianta,et al.  Frame Information Transfer from English to Italian , 2008, LREC.

[5]  Sergey Serebryakov,et al.  Methodology for Building Extraction Templates for Russian Language in Knowledge-Based IE Systems , 2012 .

[6]  Slav Petrov,et al.  Syntactic Annotations for the Google Books NGram Corpus , 2012, ACL.

[7]  Svetla Koeva,et al.  Lexicon and Grammar in Bulgarian FrameNet , 2010, LREC.

[8]  Vladimir Ivanov,et al.  Introducing Baselines for Russian Named Entity Recognition , 2013, CICLing.

[9]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[10]  Natalia V. Loukachevitch,et al.  RuThes Linguistic Ontology vs. Russian Wordnets , 2014, GWC.

[11]  Vera Danilova A Pipeline for Multilingual Protest Event Selection and Annotation , 2015, 2015 26th International Workshop on Database and Expert Systems Applications (DEXA).

[12]  Svetlana Alexeeva,et al.  Crowdsourcing morphological annotation , 2013 .

[13]  Tomaz Erjavec,et al.  Designing and Evaluating a Russian Tagset , 2008, LREC.

[14]  Charles J. Fillmore,et al.  The Structure of the Framenet Database , 2003 .

[15]  Lucien Tesnière Éléments de syntaxe structurale , 1959 .

[16]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[17]  Serge Sharoff,et al.  The proper place of men and machines in language technology Processing Russian without any linguistic knowledge , 2011 .

[18]  Uzay Kaymak,et al.  An Overview of Event Extraction from Text , 2011, DeRiVE@ISWC.

[19]  Lyashevskaya Olga Dictionary of Valencies Meets Corpus Annotation: A Case of Russian FrameBank , 2012 .

[20]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[21]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[22]  Daniel Gildea,et al.  Automatic Labeling of Semantic Roles , 2000, ACL.