Increasing Quality of the Corpus of Frequency Dictionary of Contemporary Polish for Morphosyntactic Tagging of the Polish Language

The paper is devoted to the issue of correction of the erroneous and ambiguous corpus of Frequency Dictionary of Contemporary Polish (FDCP) and its application to morphosyntactic tagging of the Polish language. Several stages of corpus transformation are presented and baseline part-of-speech tagging algorithms are evaluated, too.

[1]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[2]  János Csirik,et al.  Annotated Hungarian National Corpus , 2003, EACL.

[3]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[4]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[5]  Jan Hajic,et al.  Serial Combination of Rules and Statistics: A Case Study in Czech Tagging , 2001, ACL.

[6]  Ahmed Khorsi,et al.  Towards Hybridization of Knowledge Representation and Machine Learning , 2007, Comput. Informatics.

[7]  Nancy Ide,et al.  The MULTEXT East corpus , 1998, LREC.

[8]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[9]  Geoffrey Leech,et al.  The tagged LOB Corpus : user's manual , 1986 .

[10]  Geoffrey Leech,et al.  CLAWS4: The Tagging of the British National Corpus , 1994, COLING.

[11]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Wojciech Skut,et al.  SYNTACTIC ANNOTATION OF A GERMAN NEWSPAPER CORPUS , 2003 .

[13]  Jacek Kitowski,et al.  A Case Study of Algorithms for Morphosyntactic Tagging of Polish Language , 2007, Comput. Informatics.

[14]  János Csirik,et al.  Manually annotated Hungarian corpus , 2003 .

[15]  Walter Daelemans,et al.  Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[16]  Walter Daelemans,et al.  MBT : Memory Based Tagger, version 1.0, Reference Guide , 2002 .

[17]  Adam Przepiórkowski,et al.  An HPSG-Annotated Test Suite for Polish , 2000, LREC.

[18]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[19]  Tamás Váradi,et al.  The Hungarian National Corpus , 2002, LREC.

[20]  Jakub Zavrel,et al.  MBT : Memory-Based Tagger version 3 . 0 Reference Guide ILK Technical Report – ILK 07-04 , 2002 .

[21]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[22]  Ingo Schröder A Case Study in Part-of-Speech Tagging Using the ICOPOST Toolkit , 2002 .