XRCE's Participation to CLEF 2007 Domain-Specific Track

Our participation to CLEF07 (Domain-specific Track) was motivated this year by assessing several query translation and expansion strategies that we recently designed and developed. One line of research and development was to use our own Statistical Machine Translation system (called Matrax) and its intermediate outputs to perform query translation and disambiguation. Our idea was to benefit from Matrax’ flexibility to output more than one plausible translations and to train its Language Model component on the CLEF07 target corpora. The second line of research consisted in designing algorithms to adapt an initial, general probabilistic dictionary to a particular pair (query, target corpus); this constitutes some extreme viewpoint on the “bilingual lexicon extraction and adaptation” topic that we are investigating since now more than 6 years. For this strategy, our main contributions lie in a pseudo-feedback algorithm and an EM-like optimisation algorithm that realize this adaptation. A third axis was to evaluate the potential impact of “Lexical Entailment” models in a cross-lingual framework, as they were only used in a monolingual setting up to now. Experimental results on CLEF-2007 corpora (domain-specific track) show that the dictionary adaptation mechanisms appear quite effective in the CLIR framework, exceeding in certain cases the performance of much more complex Machine Translation systems and even the performance of the monolingual baseline. In most cases also, Lexical Entailment models, used as query expansion mechanisms, turned out to be beneficial.

[1]  Jianfeng Gao,et al.  Statistical query translation models for cross-language information retrieval , 2006, TALIP.

[2]  Michael Kluck,et al.  Domain-Specific Track CLEF 2006: Overview of Results and Approaches, Remarks on the Assessment Analysis , 2006, CLEF.

[3]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[4]  John D. Lafferty,et al.  Information Retrieval as Statistical Translation , 2017 .

[5]  Éric Gaussier,et al.  Lexical Entailment for Information Retrieval , 2006, ECIR.

[6]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[7]  Jian-Yun Nie,et al.  Using Statistical Translation Models for Bilingual IR , 2001, CLEF.

[8]  Yi Liu,et al.  A maximum coherence model for dictionary-based cross-language information retrieval , 2005, SIGIR '05.

[9]  Vivien Petras,et al.  The Domain -Specific Track at CLEF 2007 , 2007, CLEF.

[10]  Ido Dagan,et al.  A Probabilistic Classification Approach for Lexical Textual Entailment , 2005, AAAI.

[11]  Tao Tao,et al.  Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[12]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[13]  Changning Huang,et al.  Improving query translation for cross-language information retrieval using statistical models , 2001, SIGIR '01.

[14]  Wessel Kraaij,et al.  Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval , 2003, CL.

[15]  Djoerd Hiemstra,et al.  Translation Resources, Merging Strategies, and Relevance Feedback for Cross-Language Information Retrieval , 2000, CLEF.

[16]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[17]  Christof Monz,et al.  Iterative translation disambiguation for cross-language information retrieval , 2005, SIGIR '05.

[18]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[19]  De Tibeiro,et al.  Information et analyse des données , 1993 .

[20]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.