论文信息 - BiTeM site Report for TREC Chemistry 2010: Impact of Citations Feeback for Patent Prior Art Search and Chemical Compounds Expansion for Ad Hoc Retrieval

BiTeM site Report for TREC Chemistry 2010: Impact of Citations Feeback for Patent Prior Art Search and Chemical Compounds Expansion for Ad Hoc Retrieval

For two years, the TREC Chemical Track aims at evaluating participant systems in chemical patent searching. In 2010, it continued with the two tasks from 2009: Prior Art search (PA) and Technology Survey (TS). The BiTeM group participated in both tasks and obtained satisfactory results, relying on a large panel of strategies which were evaluated within the framework of past similar competitions. There are three main conclusions that we draw from this campaign. First of all, regarding a baseline computed by Information Retrieval (IR) only, simplest models achieved the best results for both tasks, such as indexing only titles, abstracts, and claims, and no stemming; however, for the PA task, the performance of this baseline remains low (Mean Average Precision 0.043) compared to last year (MAP 0.067). Further analysis of the query set reveals that description sections were in 2010 twice longer than in 2009, while citations number was stable; having longer queries obviously resulted in a degradation of the signal-to-noise ratio, and in a more complex task for standard IR. Secondly, IPC codes were of no use for the PA task, and even decreased performances, whether they were injected in the index or used for filtering the results. Because this strategy is effective when applied to EPO patents in general domain, further experiments or expertise need to determine if it fails because it is applied to a specific domain, or because the quality of IPC annotations in USPTO patents is insufficient. The last conclusion deals with our re-ranking strategy based on citations feedback for the PA task. Such a strategy led to a dramatic improvement from 0.043 to 0.261 for MAP (+ 507%), and from 0.31 to 0.62 for Recall at 500 (+ 100%). Further analysis shows that our citations feedback strategy achieves to strongly capture the chemical applicants’ behaviour, which tends to cite regular patterns of multiple patents massively inter-connected with direct citations. Results of the TS task prove the effectiveness of synonyms expansion driven by chemical entities normalization.

[1] Jingbo Zhu,et al. KNN and Re-ranking Models for English Patent Mining at NTCIR-7 , 2008, NTCIR.

[2] Patrick Ruch,et al. Report on the TREC 2009 Experiments: Chemical IR Track , 2009, TREC.

[3] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[4] Christian Lovis,et al. Automatic Prior Art Searching and Patent Encoding at CLEF-IP '10 , 2010, CLEF.

[5] M. F. Porter,et al. An algorithm for suffix stripping , 1997 .

[6] Christian Lovis,et al. Automatic IPC Encoding and Novelty Tracking for Effective Patent Mining , 2010, NTCIR.

[7] Peter Murray-Rust,et al. High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[8] Xiangji Huang,et al. Overview of the TREC 2011 Chemical IR Track , 2009, TREC.

[9] Patrick Ruch,et al. Simple Pre and Post Processing Strategies for Patent Searching in CLEF Intellectual Property Track 2009 , 2009, CLEF.

[10] Iadh Ounis,et al. Research directions in Terrier: a search engine for advanced retrieval on the Web , 2007 .

[11] Yanli Wang,et al. PubChem: a public information system for analyzing bioactivities of small molecules , 2009, Nucleic Acids Res..

[12] John Tait,et al. CLEF-IP 2009: Retrieval Experiments in the Intellectual Property Domain , 2009, CLEF.