论文信息 - Fine-Grained Morpho-Syntactic Analysis for the Under-Resourced Language Chaghatay

Fine-Grained Morpho-Syntactic Analysis for the Under-Resourced Language Chaghatay

We investigate part of speech (POS) tagging for Chaghatay, a historical language with a considerable amount of morphology but few available resources such as POS annotated corpora. In a situation where we have little training data but a large POS tagset, it is not obvious which method will be best to obtain an accurate POS tagger. We experiment with a conditional random field and a Recurrent Neural Network, augmenting the models with coarse grained POS tag information, and by utilizing additional data, either additional unannotated data used to train a language model or annotated data from a modern relative, Uyghur. Our results show that the combination of an RNN and pretraining with coarse grained POS tags reaches the highest accuracy of 76.17%.

[1] Jörg Tiedemann,et al. Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF , 2017, IJCNLP.

[2] Laurent Romary,et al. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages , 2020, ACL.

[3] Roland Vollgraf,et al. Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[4] Joakim Nivre,et al. Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[5] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[6] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[7] François Yvon,et al. Joint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier , 2012, LREC.

[8] Torsten Zesch,et al. Do LSTMs really work so well for PoS tagging? – A replication study , 2017, EMNLP.

[9] Wei Xu,et al. Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[10] Yan Liu,et al. Universal dependencies for Uyghur , 2016, WLSI/OIAF4HLT@COLING.

[11] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[12] Xu Sun,et al. Structure Regularization for Structured Prediction , 2014, NIPS.

[13] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14] Pascal Denis,et al. Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort , 2009, PACLIC.