论文信息 - Increasing the Quality and Quantity of Source Language Data for Unsupervised Cross-Lingual POS Tagging

Increasing the Quality and Quantity of Source Language Data for Unsupervised Cross-Lingual POS Tagging

Bilingual corpora offer a promising bridge between resource-rich and resource-poor languages, enabling the development of natural language processing systems for the latter. English is often selected as the resource-rich language, but another choice might give better performance. In this paper, we consider the task of unsupervised cross-lingual POS tagging, and construct a model that predicts the best source language for a given target language. In experiments on 9 languages, this model improves on using a single fixed source language. We then show that further improvements can be made by combining information from multiple source languages.

[1] Regina Barzilay,et al. Unsupervised Multilingual Learning for POS Tagging , 2008, EMNLP.

[2] Slav Petrov,et al. A Universal Part-of-Speech Tagset , 2011, LREC.

[3] Chris Brew,et al. A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources , 2004, EMNLP.

[4] Thorsten Brants,et al. TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[5] Slav Petrov,et al. Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[6] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[7] Pavel Pecina,et al. Simpler unsupervised POS tagging with bilingual projections , 2013, ACL.

[8] Tomaz Erjavec,et al. The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[9] Philipp Koehn,et al. 462 Machine Translation Systems for Europe , 2009, MTSUMMIT.

[10] Serge Sharoff,et al. Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannada using Telugu Resources , 2011 .

[11] J. Kruskal,et al. An Indoeuropean classification : a lexicostatistical experiment , 1992 .