Generation of Bilingual Dictionaries using Structural Properties

Building bilingual dictionaries from Wikipedia has been extensively studied in the area of computation linguistics. These dictionaries play a crucial role in Natural Language Processing(NLP) applications like Cross-Lingual Information Retrieval, Machine Translation and Named Entity Recognition. To build these dictionaries, most of the existing approaches use information present in Wikipedia titles, info-boxes and categories. Interestingly, not many use the structural properties of a document like sections, subsections, etc. In this work we exploit the structural properties of documents to build a bilingual English-Hindi dictionary. The main intuition behind this approach is that documents in different languages discussing the same topic are likely to have similar structural elements. Though we present our experiments only for Hindi, our approach is language independent and can be easily extended to other languages. The major contribution of our work is that the dictionary contains translation and transliteration of words which include Named Entities to a large extent. We evaluate our dictionary using manually computed precision. We generated a massive list of 72k tokens using our approach with 0.75 precision.

[1]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[2]  Azadeh Shakery,et al.  Creating a Wikipedia-based Persian-English word association dictionary , 2010, 2010 5th International Symposium on Telecommunications.

[3]  NeyHermann,et al.  A systematic comparison of various statistical alignment models , 2003 .

[4]  I. Dan Melamed A Word-to-Word Model of Translational Equivalence , 1997, ACL.

[5]  Pascale Fung,et al.  A Technical Word- and Term-Translation Aid Using Noisy Parallel Corpora across Language Groups , 2004, Machine Translation.

[6]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[7]  Wolfgang Wahlster,et al.  Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics , 1997 .

[8]  Mehdi Mohammadi,et al.  Building Bilingual Parallel Corpora Based on Wikipedia , 2010, 2010 Second International Conference on Computer Engineering and Applications.

[9]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[10]  Fred Popowich,et al.  Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics , 2009 .

[11]  Vasudeva Varma,et al.  Language independent identification of parallel sentences using Wikipedia , 2011, WWW.

[12]  Jim Breen,et al.  JMdict: a Japanese-Multilingual Dictionary , 2004 .

[13]  Vasudeva Varma,et al.  An Iterative approach to extract dictionaries from Wikipedia for under-resourced languages , 2010 .

[14]  Ulrich Heid,et al.  A Linguistically Grounded Graph Model for Bilingual Lexicon Extraction , 2010, COLING.

[15]  Takahiro Hara,et al.  Improving the extraction of bilingual terminology from Wikipedia , 2009, TOMCCAP.

[16]  Ratna Sanyal,et al.  Named Entity Recognition for Indian Languages , 2008, IJCNLP.

[17]  Takahiro Hara,et al.  An Approach for Extracting Bilingual Terminology from Wikipedia , 2008, DASFAA.

[18]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[19]  Oren Etzioni,et al.  Compiling a Massive, Multilingual Dictionary via Probabilistic Inference , 2009, ACL.

[20]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[21]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[22]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[23]  Francis M. Tyers,et al.  Extracting bilingual word pairs from Wikipedia , 2008 .