Part-of-speech Taggers for Low-resource Languages using CCA Features

In this paper, we address the challenge of creating accurate and robust partof-speech taggers for low-resource languages. We propose a method that leverages existing parallel data between the target language and a large set of resourcerich languages without ancillary resources such as tag dictionaries. Crucially, we use CCA to induce latent word representations that incorporate cross-genre distributional cues, as well as projected tags from a full array of resource-rich languages. We develop a probability-based confidence model to identify words with highly likely tag projections and use these words to train a multi-class SVM using the CCA features. Our method yields average performance of 85% accuracy for languages with almost no resources, outperforming a state-of-the-art partiallyobserved CRF model.

[1]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[2]  M. Bartlett The Square Root Transformation in Analysis of Variance , 1936 .

[3]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[4]  Young-Bum Kim,et al.  Compact Lexicon Selection with Spectral Methods , 2015, ACL.

[5]  Young-Bum Kim,et al.  Weakly Supervised Slot Tagging with Partially Labeled Sequences from Web Search Click Logs , 2015, NAACL.

[6]  Young-Bum Kim,et al.  Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets , 2012, EMNLP.

[7]  Young-Bum Kim,et al.  Universal Morphological Analysis using Structured Nearest Neighbor Prediction , 2011, EMNLP.

[8]  John DeNero,et al.  Model-Based Aligner Combination Using Dual Decomposition , 2011, ACL.

[9]  Dean P. Foster,et al.  Two Step CCA: A new spectral method for estimating vector models of words , 2012, ICML 2012.

[10]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[11]  Kuzman Ganchev,et al.  Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization , 2013, EMNLP.

[12]  Young-Bum Kim,et al.  Optimal Data Set Selection: An Application to Grapheme-to-Phoneme Conversion , 2013, HLT-NAACL.

[13]  Slav Petrov,et al.  Multi-Source Transfer of Delexicalized Dependency Parsers , 2011, EMNLP.

[14]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[15]  Karl Stratos,et al.  A Spectral Algorithm for Learning Class-Based n-gram Models of Natural Language , 2014, UAI.

[16]  Thomas Mayer,et al.  Language comparison through sparse multilingual word alignment , 2012, EACL 2012.

[17]  Ben Taskar,et al.  Wiki-ly Supervised Part-of-Speech Tagging , 2012, EMNLP.

[18]  Steven P. Abney,et al.  Automatically Inducing a Part-of-Speech Tagger by Projecting from Multiple Source Languages Across Aligned Corpora , 2005, IJCNLP.

[19]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[20]  Regina Barzilay,et al.  Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches , 2009, J. Artif. Intell. Res..

[21]  Robert Moore Fast High-Accuracy Part-of-Speech Tagging by Independent Classifiers , 2014, COLING.

[22]  Young-Bum Kim,et al.  Unsupervised Consonant-Vowel Prediction over Hundreds of Languages , 2013, ACL.

[23]  François Yvon,et al.  Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning , 2014, EMNLP.

[24]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[25]  Dan Klein,et al.  Structure compilation: trading structure for features , 2008, ICML '08.

[26]  Joakim Nivre,et al.  Token and Type Constraints for Cross-Lingual Part-of-Speech Tagging , 2013, TACL.

[27]  Dan Klein,et al.  Syntactic Transfer Using a Bilingual Lexicon , 2012, EMNLP-CoNLL.

[28]  Young-Bum Kim,et al.  Pre-training of Hidden-Unit CRFs , 2015, ACL.

[29]  Young-Bum Kim,et al.  New Transfer Learning Techniques for Disparate Label Sets , 2015, ACL.

[30]  Harry R. Glahn,et al.  Canonical Correlation and Its Relationship to Discriminant Analysis and Multiple Regression , 1968 .

[31]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[32]  Young-Bum Kim,et al.  Training a Korean SRL System with Rich Morphological Features , 2014, ACL.

[33]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[34]  Michael I. Jordan,et al.  A Probabilistic Interpretation of Canonical Correlation Analysis , 2005 .

[35]  Sham M. Kakade,et al.  Multi-view Regression Via Canonical Correlation Analysis , 2007, COLT.

[36]  Karl Stratos,et al.  Model-based Word Embeddings from Decompositions of Count Matrices , 2015, ACL.