Preliminary study into query translation for patent retrieval

Patent retrieval is a branch of Information Retrieval (IR) aiming to support patent professionals in retrieving patents that satisfy their information needs. Often, patent granting bodies require patents to be partially translated into one or more major foreign languages, so that language boundaries do not hinder their accessibility. This multilinguality of patent collections offers opportunities for improving patent retrieval. In this work we exploit these opportunities by applying query translation to patent retrieval. We expand monolingual patent queries with their translations, using both a domain-specific patent dictionary that we extract from the patent collection, and a general domain-free dictionary. Experimental evaluation on a standard CLEF-IP dataset shows that using either translation dictionary fetches similar results: query translation can help patent retrieval, but not always, and without great improvement compared to standard statistical monolingual query expansion (Rocchio). The improvement is greater when the source language is English, as opposed to French or German, a finding partly due to the effect of the complex French and German morphology upon translation accuracy, but also partly due to the prevalence of English in the collection. A thorough per-query analysis reveals that cases where standard query expansion fails (e.g. zero recall) can benefit from query translation.

[1]  John Tait Proceedings of the 1st ACM workshop on Patent information retrieval , 2008, CIKM 2008.

[2]  Fernando Diaz,et al.  Pseudo-Aligned Multilingual Corpora , 2007, IJCAI.

[3]  Mounia Lalmas,et al.  SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval , 2006 .

[4]  Wolfram Koepf,et al.  Lecture Notes in Computer Science (LNCS) , 2011 .

[5]  Masao Utiyama,et al.  Overview of the Patent Translation Task at the NTCIR-7 Workshop , 2008, NTCIR.

[6]  Ellis Horowitz,et al.  FindCite: automatically finding prior art patents , 2009 .

[7]  Roberto Cornacchia,et al.  Running CLEF-IP Experiments using a Graphical Query Builder , 2009, CLEF.

[8]  Fabienne Braune,et al.  Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora , 2010, COLING.

[9]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[10]  Kristine H. Atkinson Toward a more rational patent search paradigm , 2008, PaIR '08.

[11]  Martin Franz,et al.  Unsupervised and supervised clustering for topic tracking , 2001, SIGIR '01.

[12]  Lin Du,et al.  Word Alignment of English-Chinese Bilingual Corpus Based on Chucks , 2000, EMNLP.

[13]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[14]  Neil Rubens The Application of Fuzzy Logic to the Construction of the Ranking Function of Information Retrieval Systems , 2006, ArXiv.

[15]  Makoto Iwayama,et al.  The patent mining task in the seventh NTCIR workshop , 2008, PaIR '08.

[16]  Yiming Yang,et al.  Translingual Information Retrieval: Learning from Bilingual Corpora , 1998, Artif. Intell..

[17]  Douglas W. Oard,et al.  Probabilistic structured query methods , 2003, SIGIR.

[18]  John Tait,et al.  CLEF-IP 2009: Retrieval Experiments in the Intellectual Property Domain , 2009, CLEF.

[19]  Jianqiang Wang,et al.  User-assisted query translation for interactive cross-language information retrieval , 2008, Inf. Process. Manag..

[20]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[21]  Thomas Ertl,et al.  Iterative Integration of Visual Insights during Scalable Patent Search and Analysis , 2011, IEEE Transactions on Visualization and Computer Graphics.

[22]  C. J. van Rijsbergen,et al.  Automatically Generating Queries for Prior Art Search , 2009, CLEF.

[23]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[24]  Jianqiang Wang,et al.  Combining bidirectional translation and synonymy for cross-language information retrieval , 2006, SIGIR.

[25]  Wessel Kraaij,et al.  Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval , 2003, CL.

[26]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[27]  Leah S. Larkey,et al.  A patent search and classification system , 1999, DL '99.

[28]  Mark W. Davis,et al.  A Single Language Evaluation of a Multi-lingual Text Retrieval System , 1992, TREC.

[29]  Sara Stymne,et al.  German Compounds in Factored Statistical Machine Translation , 2008, GoTAL.

[30]  J. Davenport Editor , 1960 .

[31]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[32]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[33]  Wim Vanderbauwhede,et al.  Search system requirements of patent analysts , 2010, SIGIR '10.

[34]  Yang Xu,et al.  Query dependent pseudo-relevance feedback based on wikipedia , 2009, SIGIR.

[35]  ChengXiang Zhai,et al.  Term feedback for information retrieval with language models , 2007, SIGIR.

[36]  Véronique Hoste,et al.  Language-Independent Bilingual Terminology Extraction from a Multilingual Parallel Corpus , 2009, EACL.

[37]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[38]  W. Bruce Croft,et al.  Automatic query generation for patent search , 2009, CIKM.

[39]  Tomek Strzalkowski,et al.  Evaluating document retrieval in patent database: a preliminary report , 1997, CIKM '97.

[40]  Wei Gao,et al.  Exploiting query logs for cross-lingual query suggestions , 2010, TOIS.

[41]  David E. Losada,et al.  University of Santiago de Compostela at CLEF-IP09 , 2009, CLEF.

[42]  Hermann Ney,et al.  Statistical Machine Translation of German Compound Words , 2006, FinTAL.