Do LSTMs really work so well for PoS tagging? – A replication study

A recent study by Plank et al. (2016) found that LSTM-based PoS taggers considerably improve over the current state-of-the-art when evaluated on the corpora of the Universal Dependencies project that use a coarse-grained tagset. We replicate this study using a fresh collection of 27 corpora of 21 languages that are annotated with fine-grained tagsets of varying size. Our replication confirms the result in general, and we additionally find that the advantage of LSTMs is even bigger for larger tagsets. However, we also find that for the very large tagsets of morphologically rich languages, hand-crafted morphological lexicons are still necessary to reach state-of-the-art performance.

[1]  Nikola Ljubesic,et al.  The SETimes.HR Linguistically Annotated Corpus of Croatian , 2014, LREC.

[2]  Cristina Bosco,et al.  The Parallel-TUT: a multilingual and multiformat treebank , 2012, LREC.

[3]  Sandra M. Aluísio,et al.  An Account of the Challenge of Tagging a Reference Corpus for Brazilian Portuguese , 2003, PROPOR.

[4]  Lilja Øvrelid,et al.  The Norwegian Dependency Treebank , 2014, LREC.

[5]  Barbara Plank,et al.  Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss , 2016, ACL.

[6]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[7]  Éric Villemonte de la Clergerie,et al.  Deep Syntax Annotation of the Sequoia French Treebank , 2014, LREC.

[8]  Nikola Ljubesic,et al.  New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian , 2016, LREC.

[9]  Patrick Paroubek Language Resources as by-Product of Evaluation: The MULTITAG Example , 2000, LREC.

[10]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[11]  Simonetta Montemagni,et al.  A Resource and Tool for Super-sense Tagging of Italian Texts , 2010, LREC.

[12]  D. Hladek,et al.  Dagger: The Slovak morphological classifier , 2012, Proceedings ELMAR-2012.

[13]  Liesbeth Augustinus,et al.  AfriBooms: An Online Treebank for Afrikaans , 2016, LREC.

[14]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[15]  Stefan Evert,et al.  Is Part-of-Speech Tagging a Solved Task? An Evaluation of POS Taggers for the German Web as Corpus , 2009 .

[16]  Svetlana Alexeeva,et al.  Crowdsourcing morphological annotation , 2013 .

[17]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[18]  Robert Östling,et al.  Stagger: an Open-Source Part of Speech Tagger for Swedish , 2013 .

[19]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[20]  Montserrat Marimon,et al.  The IULA Spanish LSP Treebank , 2014, LREC.

[21]  Wolfgang Menzel,et al.  Because Size Does Matter: The Hamburg Dependency Treebank , 2014, LREC.

[22]  Sigrún Helgadóttir,et al.  The Tagged Icelandic Corpus (MÍM) , 2012 .

[23]  Alon Itai,et al.  Language resources for Hebrew , 2008, Lang. Resour. Evaluation.

[24]  Torsten Zesch,et al.  LTL-UDE $@$ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text , 2016, WAC@ACL.

[25]  Atro Voutilainen FinnTreeBank: Creating a research resource and service for language researchers with Constraint Grammar , 2011 .

[26]  Matthias Buch-Kromann,et al.  The Unified Annotation of Syntax and Discourse in the Copenhagen Dependency Treebanks , 2010, Linguistic Annotation Workshop.

[27]  Christian Biemann,et al.  EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres , 2016, WAC@ACL.

[28]  Leon Derczynski,et al.  Tune Your Brown Clustering, Please , 2015, RANLP.

[29]  János Csirik,et al.  The Szeged Treebank , 2005, TSD.

[30]  Adam Przepiórkowski,et al.  Towards the National Corpus of Polish , 2008, LREC.

[31]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[32]  Tomaž Erjavec,et al.  MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora , 2010, LREC 2010.

[33]  Gertjan van Noord,et al.  Alpino: Wide-coverage Computational Analysis of Dutch , 2000, CLIN.

[34]  Christian Biemann,et al.  Corpus Portal for Search in Monolingual Corpora , 2006, LREC.

[35]  Jan Hajic,et al.  Semi-Supervised Training for the Averaged Perceptron POS Tagger , 2009, EACL.

[36]  Mojgan Seraji,et al.  A Statistical Part-of-Speech Tagger for Persian , 2011, NODALIDA.

[37]  Kevin Duh,et al.  DyNet: The Dynamic Neural Network Toolkit , 2017, ArXiv.

[38]  BarzilayRegina,et al.  Multilingual part-of-speech tagging , 2009 .

[39]  András Kornai,et al.  HunPos: an open source trigram tagger , 2007, ACL 2007.

[40]  Wolfgang Lezius,et al.  TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[41]  Tomaz Erjavec Compiling and Using the IJS-ELAN Parallel Corpus , 2002, Informatica.

[42]  Erhard W. Hinrichs,et al.  The Tüba-D/Z Treebank: Annotating German with a Context-Free Backbone , 2004, LREC.