Semantic morphological variant selection and translation disambiguation for cross-lingual information retrieval

Cross-Lingual Information Retrieval (CLIR) enables a user to query in a language which is different from the target documents language. CLIR incorporates a translation technique based on either a manual dictionary or a probabilistic dictionary which is generated from a parallel corpus. The translation techniques for Hindi language suffer from a translation mis-mapped issue which is due to the morphological richness of Hindi language. In addition, a word may have multiple translations in a dictionary leading to word translation disambiguation issue. This paper addresses two key findings, i.e., Semantic Morphological Variant Selection (SMVS), and Hybrid Word Translation Disambiguation (HWTD), the former resolves translation mis-mapped issue and the later disambiguates the queries more effectively. The proposed techniques are investigated for FIRE ad-hoc datasets, where SMVS and HWTD at word level achieve better evaluation measures in comparison to the baseline Statistical Machine Translation.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  S. Saraswathi,et al.  State of Art: Cross Lingual Information Retrieval System for Indian Languages , 2011 .

[3]  Xue Yong,et al.  Using Google Translation in Cross-Lingual Information Retrieval * , 2008 .

[4]  Gareth J. F. Jones,et al.  Cross-Lingual Topical Relevance Models , 2012, COLING.

[5]  Namita Mittal,et al.  Exploiting Parallel Sentences and Cosine Similarity for Identifying Target Language Translation , 2016 .

[6]  Vasudeva Varma,et al.  Approximate String Matching Techniques for Effective CLIR Among Indian Languages , 2007, WILF.

[7]  Gowri Prasad,et al.  Named entity recognition approaches: A study applied to English and Hindi language , 2015, 2015 International Conference on Circuits, Power and Computing Technologies [ICCPCT-2015].

[8]  Jimmy J. Lin,et al.  Exploiting Representations from Statistical Machine Translation for Cross-Language Information Retrieval , 2014, TOIS.

[9]  Joel Nothman,et al.  Transforming Wikipedia into Named Entity Training Data , 2008, ALTA.

[10]  Marie-Francine Moens,et al.  Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[11]  John Tait,et al.  Literature Review of Cross Language Information Retrieval , 2005, WEC.

[12]  Vasudeva Varma,et al.  IIIT Hyderabad at CLEF 2007 - Adhoc Indian Language CLIR Task , 2007, CLEF.

[13]  Prasad Pingali,et al.  Statistical Transliteration for Cross Langauge Information Retrieval using HMM alignment and CRF , 2008, IJCNLP 2008.

[14]  Utpal Garain,et al.  Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language , 2017, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[15]  Philipp Cimiano,et al.  Exploiting Wikipedia for cross-lingual and multilingual information retrieval , 2012, Data Knowl. Eng..

[16]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[17]  Orhan Firat,et al.  Neural Machine Translation for Cross-Lingual Pronoun Prediction , 2017, DiscoMT@EMNLP.

[18]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[19]  Christopher D. Manning,et al.  Bilingual Word Embeddings for Phrase-Based Machine Translation , 2013, EMNLP.

[20]  A. Kumaran,et al.  Cross-Lingual Information Retrieval System for Indian Languages , 2008, IJCNLP.

[21]  Dong Zhou,et al.  Translation techniques in cross-language information retrieval , 2012, CSUR.

[22]  Mirna Adriani Using Statistical Term Similarity for Sense Disambiguation in Cross-Language Information Retrieval , 2004, Information Retrieval.

[23]  Peter D. Turney Word Sense Disambiguation by Web mining for word co-occurrence probabilities , 2004, SENSEVAL@ACL.

[24]  Amit Sinha,et al.  Cross Lingual Information Retrieval with SMT and Query Mining , 2011 .

[25]  Nurul Amelina Nasharuddin,et al.  Cross-lingual Information Retrieval State-of-the-Art , 2010 .

[26]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[27]  Udhyakumar Nallasamy,et al.  Named entity transliteration for cross-language information retrieval using compressed word format mapping algorithm , 2008, iNEWS '08.

[28]  Mandar Mitra,et al.  Word Embedding based Generalized Language Model for Information Retrieval , 2015, SIGIR.

[29]  Azadeh Shakery,et al.  Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs , 2012, Information Retrieval.

[30]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[31]  Namita Mittal,et al.  Context-based Translation for the Out of Vocabulary Words Applied to Hindi-English Cross-Lingual Information Retrieval , 2020, IETE Technical Review.

[32]  Raghavendra Udupa,et al.  Crosslingual Information Retrieval System Enhanced with Transliteration Generation and Mining , 2010 .

[33]  Namita Mittal,et al.  Named Entity Identification Based Translation Disambiguation Model , 2017, PReMI.

[34]  Pushpak Bhattacharyya,et al.  OWNS: Cross-lingual Word Sense Disambiguation Using Weighted Overlap Counts and Wordnet Based Similarity Measures , 2010, SemEval@ACL.

[35]  Kumiko Tanaka-Ishii,et al.  Inducing a Bilingual Lexicon from Short Parallel Multiword Sequences , 2017, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[36]  Falk Scholer,et al.  Machine transliteration survey , 2011, ACM Comput. Surv..

[37]  Leah S. Larkey,et al.  Hindi CLIR in thirty days , 2003, TALIP.

[38]  Christof Monz,et al.  Iterative translation disambiguation for cross-language information retrieval , 2005, SIGIR '05.

[39]  Pushpak Bhattacharyya,et al.  The IIT Bombay English-Hindi Parallel Corpus , 2017, LREC.

[40]  Juan Martínez-Romo,et al.  Choosing the best dictionary for Cross-Lingual Word Sense Disambiguation , 2015, Knowl. Based Syst..

[41]  Namita Mittal,et al.  Exploring Bilingual Word Vectors for Hindi-English Cross-Language Information Retrieval , 2016, ICIA.

[42]  Sheng Zhang,et al.  Selective Decoding for Cross-lingual Open Information Extraction , 2017, IJCNLP.

[43]  Gholamreza Haffari,et al.  Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation , 2013, ACL.

[44]  Gareth J. F. Jones,et al.  A Comparative Study of Online Translation Services for Cross Language Information Retrieval , 2015, WWW.

[45]  Ivan Titov,et al.  Inducing Crosslingual Distributed Representations of Words , 2012, COLING.