Finding translations for low-frequency words in comparable corpora

Statistical methods to extract translational equivalents from non-parallel corpora hold the promise of ensuring the required coverage and domain customisation of lexicons as well as accelerating their compilation and maintenance. A challenge for these methods are rare, less common words and expressions, which often have low corpus frequencies. However, it is rare words such as newly introduced terminology and named entities that present the main interest for practical lexical acquisition. In this article, we study possibilities of improving the extraction of low-frequency equivalents from bilingual comparable corpora. Our work is carried out in the general framework which discovers equivalences between words of different languages using similarities between their occurrence patterns found in respective monolingual corpora. We develop a method that aims to compensate for insufficient amounts of corpus evidence on rare words: prior to measuring cross-language similarities, the method uses same-language corpus data to model co-occurrence vectors of rare words by predicting their unseen co-occurrences and smoothing rare, unreliable ones. Our experimental evaluation demonstrates that the proposed method delivers a consistent and significant improvement on the conventional approach to this task.

[1]  Fatiha Sadat,et al.  An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction , 2002, COLING.

[2]  Kumiko Tanaka-Ishii,et al.  Extraction of Lexical Translations from Non-Aligned Corpora , 1996, COLING.

[3]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[4]  Silvia Bernardini,et al.  BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.

[5]  Pascale Fung,et al.  Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus , 1995, VLC@ACL.

[6]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[7]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[8]  Lillian Lee Distributional Similarity Models: Clustering vs. Nearest Neighbors , 1999, ACL.

[9]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[10]  Mirella Lapata,et al.  Evaluating and Combining Approaches to Selectional Preference Acquisition , 2003, EACL.

[11]  Carlo Strapparava,et al.  Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization , 2006, ACL.

[12]  Kyo Kageura,et al.  Bilingual Terminology Mining - Using Brain, not brawn comparable corpora , 2007, ACL.

[13]  Takehito Utsuro,et al.  Effect of Cross-Language IR in Bilingual Lexicon Acquisition from Comparable Corpora , 2003, EACL.

[14]  Emmanuel Morin,et al.  French-English Terminology Extraction from Comparable Corpora , 2005, IJCNLP.

[15]  Antal van den Bosch,et al.  Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics , 2007 .

[16]  C. R. Rao,et al.  Diversity: its measurement, decomposition, apportionment and analysis , 1982 .

[17]  William H. Fletcher Making the Web More Useful as a Source for Linguistic Corpora , 2004 .

[18]  Lillian Lee,et al.  Measures of Distributional Similarity , 1999, ACL.

[19]  Satoshi Sato,et al.  Compiling French-Japanese Terminologies from the Web , 2006, EACL.

[20]  Jörg Tiedemann Extraction of Translation Equivalents from Parallel Corpora , 1998, NODALIDA.

[21]  Frank Keller,et al.  Using the Web to Obtain Frequencies for Unseen Bigrams , 2003, CL.

[22]  Jean-Michel Renders,et al.  A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora , 2004, ACL.

[23]  Ido Dagan,et al.  Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.

[24]  James Richard Curran,et al.  From distributional to semantic similarity , 2004 .

[25]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[26]  Philipp Koehn,et al.  Estimating Word Translation Probabilities from Unrelated Monolingual Corpora Using the EM Algorithm , 2000, AAAI/IAAI.

[27]  Yannick Versley Parser evaluation across Text Types , 2005 .

[28]  Dragos Stefan Munteanu,et al.  Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.

[29]  Kenneth Ward Church,et al.  Termight: Coordinating Humans and Machines in Bilingual Terminology Acquisition , 2004, Machine Translation.

[30]  Alon Itai,et al.  Word Sense Disambiguation Using a Second Language Monolingual Corpus , 1994, CL.

[31]  Timo Järvinen,et al.  A non-projective dependency parser , 1997, ANLP.

[32]  P. Resnik Selection and information: a class-based approach to lexical relationships , 1993 .

[33]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 2022, COLING.

[34]  Pierre Zweigenbaum,et al.  Looking for Candidate Translational Equivalents in Specialized, Comparable Corpora , 2002, COLING.

[35]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.