Tools and methodologies for annotating syntax and named entities in the National Corpus of Polish

The on-going project aiming at the creation of the National Corpus of Polish assumes several levels of linguistic annotation. We present the technical environment and methodological background developed for the three upper annotation levels: the level of syntactic words and groups, and the level of named entities. We show how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus, and we discuss some particular problems faced during the elaboration of the syntactic grammar, which contains over 800 rules and is one of the largest chunking grammars for Polish. We also show how the tree editor TrEd has been customized for manual post-editing of annotations, and for further revision of discrepancies. Our XML format converters and customized archiving repository ensure the automatic data flow and efficient corpus file management. We believe that this environment or substantial parts of it can be reused in or adapted for other corpus annotation tasks.

[1]  Petr Pajas,et al.  The Coding Scheme for Annotating Extended Nominal Coreference and Bridging Anaphora in the Prague Dependency Treebank , 2009, Linguistic Annotation Workshop.

[2]  FlickingerDan On building a more efficient grammar by exploiting types , 2000 .

[3]  Jan Hajic,et al.  The Prague Dependency Treebank , 2003 .

[4]  Petya Osenova,et al.  Combining the named-entity recognition task and NP chunking strategy for robust pre-processing1 , 2002 .

[5]  Adam Przepiórkowski,et al.  Towards the Annotation of Named Entities in the National Corpus of Polish , 2010, LREC.

[6]  Marcin Wolinski,et al.  Towards a Bank of Constituent Parse Trees for Polish , 2010, TSD.

[7]  Adam Przepiórkowski,et al.  Towards the National Corpus of Polish , 2008, LREC.

[8]  Adam Przepiórkowski On Heads and Coordination in a Partial Treebank , 2006 .

[9]  Yi Zhang,et al.  Annotating Wall Street Journal Texts Using a Hand-Crafted Deep Linguistic Grammar , 2009, Linguistic Annotation Workshop.

[10]  Adam Przepiórkowski,et al.  The Design of Syntactic Annotation Levels in the National Corpus of Polish , 2010, LREC.

[11]  Adam Przepiórkowski,et al.  Powierzchniowe przetwarzanie języka polskiego , 2008 .

[12]  Christopher D. Manning,et al.  Nested Named Entity Recognition , 2009, EMNLP.

[13]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[14]  Jakub Piskorski,et al.  Lexicons and Grammars for Named Entity Annotation in the National Corpus of Polish , 2010, IIS 2010.

[15]  Adam Przepiórkowski,et al.  TEI P5 as an XML Standard for Treebank Encoding , 2009 .

[16]  Adam Przepiórkowski,et al.  Recent Developments in the National Corpus of Polish , 2010, LREC.

[17]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[18]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[19]  Dan Flickinger,et al.  On building a more effcient grammar by exploiting types , 2000, Natural Language Engineering.

[20]  Petr Pajas,et al.  Recent Advances in a Feature-Rich Framework for Treebank Annotation , 2008, COLING.

[21]  P MarcusMitchell,et al.  Building a large annotated corpus of English , 1993 .

[22]  Piotr Banski,et al.  Which XML Standards for Multilevel Corpus Annotation? , 2009, LTC.

[23]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[24]  Natalia Klyueva,et al.  Annotation of Sentence Structure; Capturing the Relationship among Clauses in Czech Sentences , 2009, Linguistic Annotation Workshop.

[25]  Jakub Piskorski,et al.  Named-Entity Recognition for Polish with SProUT , 2004, IMTCI.

[26]  Graham Wilcock,et al.  Introduction to Linguistic Annotation and Text Analytics , 2009, Synthesis Lectures on Human Language Technologies.