论文信息 - Data-driven Choices in Neural Part-of-Speech Tagging for Latin

Data-driven Choices in Neural Part-of-Speech Tagging for Latin

Textual data in ancient and historical languages such as Latin is increasingly available in machine readable forms, yet computational tools to analyze and process this data are still lacking. We describe our system for part-of-speech tagging in Latin, an entry in the EvaLatin 2020 shared task. Based on a detailed analysis of the training data, we make targeted preprocessing decisions and design our model. We leverage existing large unlabelled resources to pre-train representations at both the grapheme and word level, which serve as the inputs to our LSTM-based models. We perform an extensive cross-validated hyperparameter search, achieving an accuracy score of up to 93 on in-domain texts. We publicly release all our code and trained models in the hope that our system will be of use to social scientists and digital humanists alike. The insights we draw from our inital analysis can also inform future NLP work modeling syntactic information in Latin.

Geoff Bacon

[1] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2] Marco Passarotti,et al. Overview of the EvaLatin 2020 Evaluation Campaign , 2020, LT4HALA.

[3] Steven Moran,et al. The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles , 2017 .

[4] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.