Slavic Information Extraction and Partial Parsing

Information Extraction (IE) often involves some amount of partial syntactic processing. This is clear in cases of interesting high-level IE tasks, such as finding information about who did what to whom (when, where, how and why), but it is also true in case of simpler IE tasks, such as finding company names in texts. The aim of this paper is to give an overview of Slavonic phenomena which pose particular problems for IE and partial parsing, and some phenomena which seem easier to treat in Slavonic than in Germanic or Romance; I also mention various tools which have been used for the partial processing of Slavonic.

[1]  Robert Dale,et al.  Handling Conjunctions in Named Entities , 2007, CICLing.

[2]  Adam Przepiórkowski,et al.  A Flexemic Tagset for Polish , 2003 .

[3]  Kiril Ivanov Simov,et al.  A Hybrid Strategy For Regular Grammar Parsing , 2004, LREC.

[4]  Krassimira Ivanova,et al.  Building a Linguistically Interpreted Corpus of Bulgarian: the BulTreeBank , 2002, LREC.

[5]  Krzysztof Trojanowski,et al.  Intelligent Information Processing and Web Mining , 2008 .

[6]  Galia Angelova,et al.  SLAVONIC NAMED ENTITIES IN GATE , 2005 .

[7]  Mirosław Bańko,et al.  Inny słownik języka polskiego PWN , 2000 .

[8]  Ivana Kruijff-Korbayová,et al.  Handling Word Order in a Multilingual System for Generation of Instructions , 1999, TSD.

[9]  Rajkumar Roy,et al.  Advances in Soft Computing , 2018, Lecture Notes in Computer Science.

[10]  Anna Kupsc,et al.  Intelligent Content Extraction from Polish Medical Reports , 2004, IMTCI.

[11]  Maciej Ogrodniczuk Weryfikacja korpusu wypowiedników polskich (z wykorzystaniem gramatyki formalnej Świdzińskiego) , 2006 .

[12]  Adam Przepiórkowski,et al.  The Unberable Lightness of Tagging* A Case Study in Morphosyntactic Tagging of Polish , 2003, LINC@EACL.

[13]  Jakub Piskorski,et al.  Adapting SProUT to processing Baltic and Slavonic languages , 2003 .

[14]  Diana Maynard,et al.  JAPE: a Java Annotation Patterns Engine , 2000 .

[15]  Y. Wilks,et al.  A General Architecture for Text Engineering (gate) { a New Approach to Language Engineering R&d a General Architecture for Text Engineering (gate) | a New Approach to Language Engineering R&d a E G T , 1995 .

[16]  Janna Khegai,et al.  GF Parallel Resource Grammars and Russian , 2006, ACL.

[17]  Jakub Piskorski,et al.  Named-Entity Recognition for Polish with SProUT , 2004, IMTCI.

[18]  Diana Maynard,et al.  Creation of Reusable Components and Language Resources for Named Entity Recognition in Russian , 2004, LREC.

[19]  D. Prusa,et al.  Searching through Prague Dependency Treebank Conception and Architecture , 2002 .

[20]  Anna Kupsc,et al.  Rule-Based Medical Content Extraction and Classification , 2005, Intelligent Information Systems.

[21]  Goran Nenadic Local Grammars and Parsing Coordination of Nouns in Serbo-Croatian , 2000, TSD.

[22]  Robert Dale,et al.  Handling conjunctions in named entities , 2007 .

[23]  Adam Przepiórkowski,et al.  Towards the Automatic Extraction of Definitions in Slavic , 2007, ACL 2007.

[24]  Jakub Piskorski Rule-based Named-Entity Recognition for Polish , 2004 .

[25]  Anna Kupsc,et al.  Making Shallow Look Deeper: Anaphora and Comparisons in Medical Information Extraction , 2005 .

[26]  Oliver Christ,et al.  A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[27]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[28]  Nancy Ide,et al.  XCES: An XML-based Encoding Standard for Linguistic Corpora , 2000, LREC.

[29]  Michael Collins,et al.  A Statistical Parser for Czech , 1999, ACL.

[30]  Adam Przepiórkowski Preliminary Formalism for Simultaneous Rule-Based Tagging and Partial Parsing ∗ , .

[31]  Atanas Kiryakov,et al.  CLaRK - an XML-based System for Corpora Development 1 , 2001 .

[32]  Marcin Sydow,et al.  Lemmatization of Polish Person Names , 2007, ACL 2007.

[33]  S. Franks Parameters of Slavic morphosyntax , 1995 .

[34]  Bruno Pouliquen,et al.  Cross-lingual Named Entity Recognition , 2007 .

[35]  Aleksander Buczynski An Implementation of Combined Partial Parser and Morphosyntactic Disambiguator , 2007, ACL.

[36]  Agnieszka Mykowiecka,et al.  Automatic Processing of Diabetic Patients’ Hospital Documentation , 2007, ACL 2007.

[37]  Daniel Zeman How Much Will a RE-Based Preprocessor Help a Statistical Parser? , 2001, IWPT.

[38]  Hristo Tanev,et al.  Socrates: A question answering prototype for Bulgarian , 2003, RANLP.

[39]  Adam Przepiórkowski,et al.  Baseline Experiments in the Extraction of Polish Valence Frames , 2005, Intelligent Information Systems.

[40]  Marcin Wolinski,et al.  Morfeusz - a Practical Tool for the Morphological Analysis of Polish , 2006, Intelligent Information Systems.

[41]  Kiril Ivanov Simov,et al.  A Language Resources Infrastructure for Bulgarian , 2004, LREC.

[42]  Zbigniew Michalewicz,et al.  Intelligent Media Technology for Communicative Intelligence, Second International Workshop, IMTCI 2004, Warsaw, Poland, September 13-14, 2004, Revised Selected Papers , 2005, IMTCI.

[43]  Ruslan Mitkov,et al.  Shallow Language Processing Architecture for Bulgarian , 2002, COLING.

[44]  Karel Pala,et al.  Corpus annotation in inflectional languages: Czech , 1998, Proceedings Ninth International Workshop on Database and Expert Systems Applications (Cat. No.98EX130).

[45]  Božo Bekavac,et al.  Implementation of Croatian NERC System , 2007, ACL 2007.

[46]  Jan Hajic,et al.  Probabilistic and Rule-Based Tagger of an Inflective Language- a Comparison , 1997, ANLP.

[47]  Adam Przepiórkowski,et al.  Information Extraction for Polish Using the SProUT Platform , 2004, Intelligent Information Systems.

[48]  Zbigniew Michalewicz,et al.  Intelligent Media Technology for Communicative Intelligence: Second International Workshop, IMTCI 2004, Warsaw, Poland, September 13-14, 2004. Revised ... / Lecture Notes in Artificial Intelligence) , 2005 .

[49]  Jakub Piskorski,et al.  Extraction of Polish Named-Entities , 2004, LREC.

[50]  Adam Przepiórkowski,et al.  Poliqarp: An open source corpus indexer and search engine with syntactic extensions , 2007, ACL.

[51]  Svetla Koeva Multi-word Term Extraction for Bulgarian , 2007, ACL 2007.

[52]  Piotr Banski,et al.  A Search Tool for Corpora with Positional Tagsets and Ambiguities , 2004, LREC.

[53]  Serge Sharoff,et al.  What is at Stake: a Case Study of Russian Expressions Starting with a Preposition , 2004 .

[54]  Vasile Rus,et al.  Unsupervised Method for Parsing Coordinated Base Noun Phrases , 2009, CICLing.

[55]  Adam Przepiórkowski On Heads and Coordination in Valence Acquisition , 2007, CICLing.

[56]  Adam Przepiórkowski Automatic Extraction of Polish Verb Subcategorization An Evaluation of Common Statistics , 2005 .