Combining Categorization-based and Corpus-based Approaches for CLIR

Whether or not we can use existing concept taxonomies to help cross-lingual information retrieval (CLIR) is an open question. This paper investigates an interlingual approach that uses the MeSH categories in the medical domain to index bilingual documents and queries and to measure their relevance based on a category-level matching. We conducted bilingual retrieval experiments on a new corpus (Springer) of medical documents and queries, in the languages of English and German. We also evaluated several high-performing corpus-based learning methods and a machine translation (MT) based approach using SYSTRAN, a commercial system with strong results on CLIR benchmarks. Our results on Springer show that the categorization-based approach significantly outperformed the MT-based approach, but underperformed the corpus-based methods due to the loss of detailed information from the category-level indexing. Combining the output of categorization-based retrieval and corpus-based retrieval yielded a significant performance improvement over using either alone.

[1]  Carol Peters,et al.  Comparative evaluation of multilingual information access systems : 4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003, Trondheim, Norway, August 21-22, 2003 : revised papers , 2004 .

[2]  Salim Roukos,et al.  Ad hoc and Multilingual Information Retrieval at IBM , 1998, TREC.

[3]  Alexander M. Fraser,et al.  TREC 2001 Cross-lingual Retrieval at BBN , 2001, TREC.

[4]  Yiming Yang,et al.  Multilingual Information Retrieval Using Open, Transparent Resources in CLEF 2003 , 2003, CLEF.

[5]  Yiming Yang,et al.  Translingual Information Retrieval: A Comparative Evaluation , 1997, IJCAI.

[6]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[7]  Padmini Srinivasan,et al.  Cross-language information retrieval with the UMLS metathesaurus , 1998, SIGIR '98.

[8]  William R. Hersh,et al.  SAPHIRE International: a tool for cross-language information retrieval , 1998, AMIA.

[9]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[10]  Paul Buitelaar,et al.  Semantic annotation for concept-based cross-language medical information retrieval , 2002, Int. J. Medical Informatics.

[11]  Yiming Yang,et al.  A Loss Function Analysis for Classification Methods in Text Categorization , 2003, ICML.

[12]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[13]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[14]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[15]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[16]  Martin Franz,et al.  Arabic Information Retrieval at IBM , 2002, Text Retrieval Conference.

[17]  Fredric C. Gey,et al.  The TREC 2002 Arabic/English CLIR Track , 2002, TREC.

[18]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[19]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[20]  Yiming Yang,et al.  Translingual Information Retrieval: Learning from Bilingual Corpora , 1998, Artif. Intell..

[21]  Jacques Savoy,et al.  A Stemming Procedure and Stopword List for General French Corpora , 1999, J. Am. Soc. Inf. Sci..

[22]  Noriko Kando,et al.  Overview of the Third NTCIR Workshop , 2002, NTCIR.

[23]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[24]  M. Felisa Verdejo,et al.  Using Eurowordnet in a Concept-Based Approach to Cross-Language Text Retrieval , 1999, Appl. Artif. Intell..

[25]  Fredric C. Gey,et al.  English-German Cross-Language Retrieval for the GIRT Collection - Exploiting a Multilingual Thesaurus , 1999, TREC.

[26]  Yiming Yang,et al.  Resource selection for domain-specific cross-lingual IR , 2004, SIGIR '04.

[27]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.