Recent Developments in the National Corpus of Polish

The aim of the paper is to present recent ― as of March 2010 ― developments in the construction of the National Corpus of Polish (NKJP). The NKJP project was launched at the very end of 2007 and it is aimed at compiling a large, linguistically annotated corpus of contemporary Polish by the end of 2010. Out of the total pool of 1 billion words of text data collected in the project, a 300 million word balanced corpus will be selected to match a set of predefined representativeness criteria. This present paper outlines a number of recent developments in the NKJP project, including: 1) the design of text encoding XML schemata for various levels of linguistic information, 2) a new tool for manual annotation at various levels, 3) numerous improvements in search tools. As the work on NKJP progresses, it becomes clear that this project serves as an important testbed for linguistic annotation and interoperability standards. We believe that our recent experiences will prove relevant to future large-scale language resource compilation efforts.

[1]  Adam Przepiórkowski,et al.  A comparison of two morphosyntactic tagsets of Polish , 2009 .

[2]  Adam Przepiórkowski,et al.  Manual annotation of the National Corpus of Polish with Anotatornia , 2011 .

[3]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[4]  Adam Przepiórkowski,et al.  XML Text Interchange Format in the National Corpus of Polish , 2011 .

[5]  Adam Przepiórkowski,et al.  A Flexemic Tagset for Polish , 2003 .

[6]  Adam Przepiórkowski,et al.  Towards the Annotation of Named Entities in the National Corpus of Polish , 2010, LREC.

[7]  Piotr Banski,et al.  Stand-off TEI Annotation: the Case of the National Corpus of Polish , 2009, Linguistic Annotation Workshop.

[8]  Adam Przepiórkowski,et al.  The Design of Syntactic Annotation Levels in the National Corpus of Polish , 2010, LREC.

[9]  Piotr Banski,et al.  Which XML Standards for Multilevel Corpus Annotation? , 2009, LTC.

[10]  Adam Przepiórkowski,et al.  The WSD Development Environment , 2009, LTC.

[11]  Stefanie Dipper,et al.  XML-based Stand-off Representation and Exploitation of Multi-Level Linguistic Annotation , 2005, Berliner XML Tage.

[12]  Barbara Lewandowska-Tomaszczyk,et al.  Practical Applications in Language and Computers , 2006 .

[13]  Adam Przepiórkowski,et al.  The Unberable Lightness of Tagging* A Case Study in Morphosyntactic Tagging of Polish , 2003, LINC@EACL.

[14]  Adam Przepiórkowski,et al.  Towards the National Corpus of Polish , 2008, LREC.

[15]  Marcin Wolinski,et al.  Morfeusz - a Practical Tool for the Morphological Analysis of Polish , 2006, Intelligent Information Systems.

[16]  Adam Przepiórkowski,et al.  TEI P5 as an XML Standard for Treebank Encoding , 2009 .

[17]  Piotr Banski,et al.  A Search Tool for Corpora with Positional Tagsets and Ambiguities , 2004, LREC.

[18]  Adam Przepiórkowski,et al.  Poliqarp: An open source corpus indexer and search engine with syntactic extensions , 2007, ACL.