Unsupervised Learning of Cross-Lingual Symbol Embeddings Without Parallel Data

We present a new method for unsupervised learning of multilingual symbol (e.g. character) embeddings, without any parallel data or prior knowledge about correspondences between languages. It is able to exploit similarities across languages between the distributions over symbols’ contexts of use within their language, even in the absence of any symbols in common to the two languages. In experiments with an artificially corrupted text corpus, we show that the method can retrieve character correspondences obscured by noise. We then present encouraging results of applying the method to real linguistic data, including for low-resourced languages. The learned representations open the possibility of fully unsupervised comparative studies of text or speech corpora in low-resourced languages with no prior knowledge regarding their symbol sets.

[1]  Jonas Kuhn Experiments in parallel-text based grammar induction , 2004, ACL.

[2]  Amalia Todirascu-Courtier,et al.  Using Cognates in a French-Romanian Lexical Alignment System: A Comparative Study , 2011, RANLP.

[3]  Jörg Tiedemann,et al.  Continuous multilinguality with language vectors , 2016, EACL.

[4]  Robert Frank,et al.  Phonologically Informed Edit Distance Algorithms for Word Alignment with Low-Resource Languages , 2018 .

[5]  José Gabriel Pereira Lopes,et al.  Measuring Spelling Similarity for Cognate Identification , 2011, EPIA.

[6]  Kalervo Järvelin,et al.  Fuzzy translation of cross-lingual spelling variants , 2003, SIGIR.

[7]  Grzegorz Kondrak,et al.  Identification of Cognates and Recurrent Sound Correspondences in Word Lists , 2009, TAL.

[8]  Johann-Mattis List,et al.  LexStat: Automatic Detection of Cognates in Multilingual Wordlists , 2012, EACL 2012.

[9]  R. V. Bezooijen,et al.  Lexical and orthographic distances between Germanic, Romance and Slavic languages and their relationship to geographic distance (Wilbert Heeringa, Jelena Golubovic, Charlotte Gooskens, Anja Schüppert, Femke Swarte & Stefanie Voigt) , 2013 .

[10]  Frisian Does Instruction about Phonological Correspondences Contribute to the Intelligibility of a Related Language ? A Study with Speakers of Dutch Learning , 2014 .

[11]  Filippo Petroni,et al.  Language distance and tree reconstruction , 2008 .

[12]  Andrew McCallum,et al.  Relation Extraction with Matrix Factorization and Universal Schemas , 2013, NAACL.

[13]  Kenji Kita Automatic Clustering of Languages Based on Probabilistic Models , 1999 .

[14]  Guillaume Lample,et al.  Polyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation Learning , 2016, NAACL.

[15]  Vladimir Batagelj,et al.  Automatic clustering of languages , 1992 .

[16]  Diana Inkpen,et al.  Automatic Identification of Cognates and False Friends in French and English , 2005 .

[17]  Kadri Muischnek,et al.  The Estonian Reference Corpus: Its Composition and Morphology-aware User Interface , 2010, Baltic HLT.

[18]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[19]  Andrea Mulloni,et al.  Automatic Prediction of Cognate Orthography Using Support Vector Machines , 2007, ACL.

[20]  John Nerbonne,et al.  Multiple Sequence Alignments in Linguistics , 2009, LaTeCH - SHELT&R@EACL.

[21]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[22]  Chris Brew,et al.  Word-Pair Extraction for Lexicography , 1996 .

[23]  Iñaki Alegria,et al.  From language identification to language distance , 2017 .

[24]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[25]  Johann-Mattis List,et al.  Sequence comparison in historical linguistics , 2021 .

[26]  Dan Klein,et al.  Finding Cognate Groups Using Phylogenies , 2010, ACL.

[27]  Liviu P. Dinu,et al.  Automatic Detection of Cognates Using Orthographic Alignment , 2014, ACL.

[28]  Regina Barzilay,et al.  Unsupervised Multilingual Grammar Induction , 2009, ACL.

[29]  Nello Cristianini,et al.  String Similarity Measures and Pam-like Matrices for Cognate Identification , 2010 .

[30]  Mark Steedman,et al.  Turning the pipeline into a loop: Iterated unsupervised dependency parsing and PoS induction , 2012, HLT-NAACL 2012.

[31]  Michael Cysouw,et al.  Combining Regular Sound Correspondences and Geographic Spread , 2013 .

[32]  Graeme Hirst,et al.  Algorithms for language reconstruction , 2002 .

[33]  Viktor Pekar,et al.  Automatic Detection of Orthographics Cues for Cognate Recognition , 2006, LREC.