Language-Independent Context Aware Query Translation using Wikipedia

Cross lingual information access (CLIA) systems are required to access the large amounts of multilingual content generated on the world wide web in the form of blogs, news articles and documents. In this paper, we discuss our approach to query formation for CLIA systems where language resources are replaced by Wikipedia. We claim that Wikipedia, with its rich multilingual content and structure, forms an ideal platform to build a CLIA system. Our approach is particularly useful for under-resourced languages, as all the languages don't have the resources(tools) with sufficient accuracies. We propose a context aware language-independent query formation method which, with the help of bilingual dictionaries, forms queries in the target language. Results are encouraging with a precision of 69.75% and thus endorse our claim on using Wikipedia for building CLIA systems.

[1]  CLLE-ERSS,et al.  Query Translation using Wikipedia-based resources for analysis and disambiguation , 2010 .

[2]  Douglas W. Oard,et al.  A survey of multilingual text retrieval , 1996 .

[3]  Mehdi Mohammadi,et al.  Building Bilingual Parallel Corpora Based on Wikipedia , 2010, 2010 Second International Conference on Computer Engineering and Applications.

[4]  Eiichiro Sumita,et al.  Method for Building Sentence-Aligned Corpus from Wikipedia , 2008 .

[5]  Johan A. K. Suykens,et al.  Advances in learning theory : methods, models and applications , 2003 .

[6]  RetrievalMichael,et al.  A Comparison of Two Corpus-BasedMethods forTranslingual Information , 2000 .

[7]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[8]  Vasudeva Varma,et al.  Language independent identification of parallel sentences using Wikipedia , 2011, WWW.

[9]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[10]  M. Krötzsch,et al.  Wikipedia and the Semantic Web The Missing Links ? , 2005 .

[11]  Masatoshi Yoshikawa,et al.  Bilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval , 2003, ACL.

[12]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[13]  Jun Adachi,et al.  NTCIR workshops: data collection-based evaluation of information retrieval and its challenges , 2000, Proceedings 2000 Kyoto International Conference on Digital Libraries: Research and Practice.

[14]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[15]  Douglas W. Oard,et al.  A comparative study of query and document translation for cross-language information retrieval , 1998, AMTA.

[16]  Antonio Toral,et al.  Applying Wikipedia's Multilingual Knowledge to Cross-Lingual Question Answering , 2007, NLDB.

[17]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[18]  Hercules Dalianis,et al.  Identification of Parallel Text Pairs Using Fingerprints , 2009, RANLP.

[19]  Padmini Srinivasan,et al.  Thesaurus Construction , 1992, Information Retrieval: Data Structures & Algorithms.

[20]  Rada Mihalcea,et al.  Using Wikipedia for Automatic Word Sense Disambiguation , 2007, NAACL.

[21]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[22]  Shih-Hung Wu,et al.  Query Expansion via Link Analysis of Wikipedia for CLIR , 2008, NTCIR.

[23]  Yiming Yang,et al.  Translingual Information Retrieval: A Comparative Evaluation , 1997, IJCAI.

[24]  Djoerd Hiemstra,et al.  Disambiguation Strategies for Cross-Language Information Retrieval , 1999, ECDL.

[25]  Lada A. Adamic,et al.  Internet: Growth dynamics of the World-Wide Web , 1999, Nature.

[26]  Pascale Fung,et al.  Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[27]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[28]  Francis M. Tyers,et al.  Extracting bilingual word pairs from Wikipedia , 2008 .

[29]  Sébastien Paquet,et al.  The Cross-Lingual Wiki Engine: enabling collaboration across language barriers , 2008, Int. Sym. Wikis.

[30]  Francis M. Tyers,et al.  Collaboration: interoperability between people in the creation of language resources for less-resourced languages , 2008 .

[31]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[32]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[33]  Eneko Agirre,et al.  Advances in Multilingual and Multimodal Information Retrieval. , 2008 .

[34]  Naren Datha,et al.  WikiBABEL: A Wiki-style Platform for Creation of Parallel Data , 2009, ACL/IJCNLP.

[35]  Alexander M. Fraser,et al.  Improved Machine Translation Performance via Parallel Sentence Extraction from Comparable Corpora , 2004, NAACL.

[36]  David A. Hull Using Structured Queries for Disambiguation in Cross-Language Information Retrieval , 1997 .

[37]  Christoph Tillmann,et al.  A Beam-Search Extraction Algorithm for Comparable Data , 2009, ACL.

[38]  Jim Breen,et al.  JMdict: a Japanese-Multilingual Dictionary , 2004 .

[39]  Takahiro Hara,et al.  Improving the extraction of bilingual terminology from Wikipedia , 2009, TOMCCAP.

[40]  Sobha Lalitha Devi,et al.  How to Get the Same News from Different Language News Papers , 2010 .

[41]  Sergei Nirenburg,et al.  A Statistical Approach to Machine Translation , 2003 .

[42]  Mark W. Davis,et al.  QUILT: implementing a large-scale cross-language text retrieval system , 1997, SIGIR '97.

[43]  Christoph Tillmann,et al.  A Simple Sentence-Level Extraction Algorithm for Comparable Data , 2009, NAACL.

[44]  Takahiro Hara,et al.  An Approach for Extracting Bilingual Terminology from Wikipedia , 2008, DASFAA.

[45]  Fredric C. Gey,et al.  Cross language information retrieval: a research roadmap , 2002, SIGF.

[46]  András A. Benczúr,et al.  Cross-Language Retrieval with Wikipedia , 2008, CLEF.

[47]  Patrick Schone,et al.  Mining Wiki Resources for Multilingual Named Entity Recognition , 2008, ACL.

[48]  Dagobert Soergel,et al.  Multilingual Thesauri in Cross-Language Text and Speech Retrieval , 1997 .

[49]  Pascale Fung,et al.  A Technical Word- and Term-Translation Aid Using Noisy Parallel Corpora across Language Groups , 2004, Machine Translation.