Fine-Grained Morpho-Syntactic Analysis for the Under-Resourced Language Chaghatay

We investigate part of speech (POS) tagging for Chaghatay, a historical language with a considerable amount of morphology but few available resources such as POS annotated corpora. In a situation where we have little training data but a large POS tagset, it is not obvious which method will be best to obtain an accurate POS tagger. We experiment with a conditional random field and a Recurrent Neural Network, augmenting the models with coarse grained POS tag information, and by utilizing additional data, either additional unannotated data used to train a language model or annotated data from a modern relative, Uyghur. Our results show that the combination of an RNN and pretraining with coarse grained POS tags reaches the highest accuracy of 76.17%.

[1]  Jörg Tiedemann,et al.  Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF , 2017, IJCNLP.

[2]  Laurent Romary,et al.  A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages , 2020, ACL.

[3]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[4]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[5]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[6]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[7]  François Yvon,et al.  Joint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier , 2012, LREC.

[8]  Torsten Zesch,et al.  Do LSTMs really work so well for PoS tagging? – A replication study , 2017, EMNLP.

[9]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[10]  Yan Liu,et al.  Universal dependencies for Uyghur , 2016, WLSI/OIAF4HLT@COLING.

[11]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[12]  Xu Sun,et al.  Structure Regularization for Structured Prediction , 2014, NIPS.

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Pascal Denis,et al.  Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort , 2009, PACLIC.