Wikipedia-based query phrase expansion in patent class search

Relevance feedback methods generally suffer from topic drift caused by word ambiguities and synonymous uses of words. Topic drift is an important issue in patent information retrieval as people tend to use different expressions describing similar concepts causing low precision and recall at the same time. Furthermore, failing to retrieve relevant patents to an application during the examination process may cause legal problems caused by granting an existing invention. A possible cause of topic drift is utilizing a relevance feedback-based search method. As a way to alleviate the inherent problem, we propose a novel query phrase expansion approach utilizing semantic annotations in Wikipedia pages, trying to enrich queries with phrases disambiguating the original query words. The idea was implemented for patent search where patents are classified into a hierarchy of categories, and the analyses of the experimental results showed not only the positive roles of phrases and words in retrieving additional relevant documents through query expansion but also their contributions to alleviating the query drift problem. More specifically, our query expansion method was compared against relevance-based language model, a state-of-the-art query expansion method, to show its superiority in terms of MAP on all levels of the classification hierarchy.

[1]  Jian-Yun Nie,et al.  Adapting information retrieval to query contexts , 2008, Inf. Process. Manag..

[2]  W. Bruce Croft,et al.  Term clustering of syntactic phrases , 1989, SIGIR '90.

[3]  Suzan Verberne,et al.  Patent Classification Experiments with the Linguistic Classification System LCS , 2010, CLEF.

[4]  Noriko Kando,et al.  Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task , 2007, NTCIR.

[5]  Suzan Verberne,et al.  Text Representations for Patent Classification , 2013, CL.

[6]  Burkhard Schafer,et al.  Concept and Context in Legal Information Retrieval , 2008, JURIX.

[7]  E. Francesconi,et al.  JURIX 2008 : The Twenty-First Annual Conference ( , 2008 .

[8]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[9]  Cornelis H.A. Koster,et al.  Phrase-based document categorization revisited , 2009 .

[10]  Wei-Ying Ma,et al.  Query Expansion by Mining User Logs , 2003, IEEE Trans. Knowl. Data Eng..

[11]  Kui-Lam Kwok,et al.  Improving two-stage ad-hoc retrieval for short queries , 1998, SIGIR '98.

[12]  Wim Vanderbauwhede,et al.  Search system requirements of patent analysts , 2010, SIGIR '10.

[13]  Yang Xu,et al.  Query dependent pseudo-relevance feedback based on wikipedia , 2009, SIGIR.

[14]  Mark S. Staveley,et al.  Phrasier: a system for interactive document retrieval using keyphrases , 1999, SIGIR '99.

[15]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[16]  Mostafa Keikha,et al.  Automatic refinement of patent queries using concept importance predictors , 2012, SIGIR '12.

[17]  Walid Magdy,et al.  A study on query expansion methods for patent retrieval , 2011, PaIR '11.

[18]  Vasudeva Varma,et al.  Exploiting Structure and Content of Wikipedia for Query Expansion in the Context , 2009, RANLP.

[19]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[20]  Leif Azzopardi,et al.  Retrievability: an evaluation measure for higher order information access tasks , 2008, CIKM '08.

[21]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[22]  Roberto Navigli,et al.  An analysis of ontology-based query expansion strategies , 2003 .

[23]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[24]  James Allan,et al.  A cluster-based resampling method for pseudo-relevance feedback , 2008, SIGIR '08.

[25]  Suzan Verberne,et al.  Phrase-Based Document Categorization , 2011, Current Challenges in Patent Information Retrieval.

[26]  ChengXiang Zhai,et al.  Adaptive relevance feedback in information retrieval , 2009, CIKM.

[27]  W. Bruce Croft,et al.  Transforming patents into prior-art queries , 2009, SIGIR.

[28]  Sung-Hyon Myaeng,et al.  Query Phrase Expansion Using Wikipedia in Patent Class Search , 2011, AIRS.

[29]  Laurent Romary,et al.  PATATRAS: Retrieval Model Combination and Regression Models for Prior Art Search , 2009, CLEF.

[30]  Milad Shokouhi,et al.  Query Expansion Using External Evidence , 2009, ECIR.

[31]  Stephen E. Robertson,et al.  On document relevance and lexical cohesion between query terms , 2006, Inf. Process. Manag..

[32]  Gongzhu Hu,et al.  Document classification efficiency of phrase-based techniques , 2009, 2009 IEEE/ACS International Conference on Computer Systems and Applications.

[33]  Korris Fu-Lai Chung,et al.  Improving weak ad-hoc queries using wikipedia asexternal corpus , 2007, SIGIR.

[34]  Allan Hanbury,et al.  CLEF-IP 2011: Retrieval in the Intellectual Property Domain , 2011, CLEF.

[35]  Hyung-Kook Seo,et al.  CLEF-IP 2011 Working Notes: Utilizing Prior Art Candidate Search Results for Refined IPC Classification , 2011, CLEF.

[36]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[37]  Olga Vechtomova,et al.  Query expansion with terms selected using lexical cohesion analysis of documents , 2007, Inf. Process. Manag..

[38]  Carol Peters,et al.  Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments , 2009 .

[39]  Jaime G. Carbonell,et al.  Document Representation and Query Expansion Models for Blog Recommendation , 2008, ICWSM.

[40]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[41]  W. Bruce Croft,et al.  Parameterized concept weighting in verbose queries , 2011, SIGIR.

[42]  Sung-Hyon Myaeng,et al.  IRNLP@KAIST in Subtask of Research Papers Classification in NTCIR-8 , 2010, NTCIR.

[43]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[44]  Jintao Li,et al.  Improved latent concept expansion using hierarchical markov random fields , 2010, CIKM.

[45]  Christina Lioma,et al.  Expanding Queries with Term and Phrase Translations in Patent Retrieval , 2011, IRFC.

[46]  Makoto Iwayama,et al.  Overview of the Patent Mining Task at the NTCIR-7 Workshop , 2008, NTCIR.

[47]  Andreas Rauber,et al.  Improving Retrievability of Patents in Prior-Art Search , 2010, ECIR.

[48]  Avi Arampatzis,et al.  Phrase-based Information Retrieval , 1998 .

[49]  Kristine H. Atkinson Toward a more rational patent search paradigm , 2008, PaIR '08.

[50]  John Tait,et al.  Current Challenges in Patent Information Retrieval , 2011, The Information Retrieval Series.

[51]  Avi Arampatzis,et al.  Phase-Based Information Retrieval , 1998, Inf. Process. Manag..