Climbing the Tower of Babel: Unsupervised Multilingual Learning

For centuries, scholars have explored the deep links among human languages. In this paper, we present a class of probabilistic models that use these links as a form of naturally occurring supervision. These models allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing. Besides these traditional NLP tasks, we also present a multilingual model for the computational decipherment of lost languages.

[1]  Regina Barzilay,et al.  Adding More Languages Improves Unsupervised Multilingual Part-of-Speech Tagging: a Bayesian Non-Parametric Approach , 2009, NAACL.

[2]  Regina Barzilay,et al.  Unsupervised Multilingual Grammar Induction , 2009, ACL.

[3]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[4]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[5]  Dan Klein,et al.  A Generative Constituent-Context Model for Improved Grammar Induction , 2002, ACL.

[6]  Glenn Carroll,et al.  Two Experiments on Learning Probabilistic Dependency Grammars from Corpora , 1992 .

[7]  Tomaž Erjavec,et al.  MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora , 2010, LREC 2010.

[8]  Tomaz Erjavec,et al.  MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora , 2004, LREC.

[9]  Regina Barzilay,et al.  Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches , 2009, J. Artif. Intell. Res..

[10]  Tao Jiang,et al.  Alignment of Trees - An Alternative to Tree Edit , 1994, Theor. Comput. Sci..

[11]  Regina Barzilay,et al.  Cross-lingual Propagation for Morphological Analysis , 2008, AAAI.

[12]  Regina Barzilay,et al.  Unsupervised Multilingual Learning for POS Tagging , 2008, EMNLP.

[13]  Regina Barzilay,et al.  Unsupervised Multilingual Learning for Morphological Segmentation , 2008, ACL.

[14]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[15]  Stuart James Lost Languages: The Enigma of the World's Undeciphered Scripts (revised ed.) , 2009 .

[16]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[17]  A. Robinson Lost Languages: The Enigma of the World's Undeciphered Scripts , 2002 .

[18]  Bernard Comrie,et al.  Language Universals and Linguistic Typology: Syntax and Morphology , 1981 .

[19]  Emily M. Bender Linguistically Naïve != Language Independent: Why NLP Needs Linguistic Typology , 2009 .

[20]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.