Enrichment of Bilingual Dictionary through News Stream Data

Bilingual dictionaries are the key component of the cross-lingual similarity estimation methods. Usually such dictionary generation is accomplished by manual or automatic means. Automatic generation approaches include to exploit parallel or comparable data to derive dictionary entries. Such approaches require large amount of bilingual data in order to produce good quality dictionary. Many time the language pair does not have large bilingual comparable corpora and in such cases the best automatic dictionary is upper bounded by the quality and coverage of such corpora. In this work we propose a method which exploits continuous quasi-comparable corpora to derive term level associations for enrichment of such limited dictionary. Though we propose our experiments for English and Hindi, our approach can be easily extendable to other languages. We evaluated dictionary by manually computing the precision. In experiments we show our approach is able to derive interesting term level associations across languages.

[1]  Robert J. Gaizauskas,et al.  Assessing the Comparability of News Texts , 2012, LREC.

[2]  Chenhui Chu,et al.  Accurate Parallel Fragment Extraction from Quasi–Comparable Corpora using Alignment Model and Translation Lexicon , 2013, IJCNLP.

[3]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[4]  References , 1971 .

[5]  Paul D. Clough,et al.  PAN@FIRE 2013: Overview of the Cross-Language !ndian News Story Search (CL!NSS) Track , 2013 .

[6]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[7]  Susan T. Dumais,et al.  Automatic cross-linguistic information retrieval using latent semantic indexing , 2007 .

[8]  Martin Braschler,et al.  Multilingual Information Retrieval Based on Document Alignment Techniques , 1998, ECDL.

[9]  Pascale Fung,et al.  Multi-level Bootstrapping For Extracting Parallel Sentences From a Quasi-Comparable Corpus , 2004, COLING.

[10]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[11]  Vasudeva Varma,et al.  Generation of Bilingual Dictionaries using Structural Properties , 2013 .

[12]  John C. Platt,et al.  Translingual Document Representations from Discriminative Projections , 2010, EMNLP.

[13]  Paolo Rosso,et al.  PAN@FIRE: Overview of the Cross-Language !ndian News Story Search (CL!NSS) Track , 2013, FIRE.

[14]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[15]  Monojit Choudhury,et al.  Challenges in Designing Input Method Editors for Indian Lan-guages: The Role of Word-Origin and Context , 2011, WTIM@IJCNLP.

[16]  Dragos Stefan Munteanu,et al.  Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.