Prior Art Search in Chemistry Patents Based On Semantic Concepts and Co-Citation Analysis

Prior Art Search is a task of querying and retrieving the patents in order to uncover any knowledge existing prior to the inventor’s question or invention at hand. For addressing this task, we present a contemporary approach that has been evaluated during Trecchem for its ability to adapt to text containing chemistry-based information. The core of the framework is an index of 1.3 million chemistry patents provided as a data set by Trecchem. For the prior art search task, the information of normalized noun phrases, biomedical and chemical entities are added to the full text index. Altogether, 7 runs were submitted for this task that were based on automatic querying with tokens, noun phrases and entities. In addition, the co-citation information was exploited in a systematic way to generate ranked citation sets from the retrieved documents. Querying with noun phrases and entities coupled with co-citation based post-processing performed considerably well with the best MAP score of 0.23.

[1]  Martin Hofmann-Apitius,et al.  Knowledge environments representing molecular entities for the virtual physiological human , 2008, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[2]  Dolf Trieschnigg,et al.  Concept Based Document Retrieval for Genomics Literature , 2006, TREC.

[3]  Patrick Ruch,et al.  Report on the TREC 2009 Experiments: Chemical IR Track , 2009, TREC.

[4]  Ophir Frieder,et al.  IIT TREC 2007 Genomics Track: Using Concept-Based Semantics in Context for Genomics Literature Passage Retrieval , 2007, TREC.

[5]  Martin Hofmann-Apitius,et al.  Chemical Names: Terminological Resources and Corpora Annotation , 2008, LREC 2008.

[6]  Martin Hofmann-Apitius,et al.  Detection of IUPAC and IUPAC-like chemical names , 2008, ISMB.

[7]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[8]  Warren A Kibbe,et al.  Mining biomedical data using MetaMap Transfer (MMtx) and the Unified Medical Language System (UMLS). , 2007, Methods in molecular biology.

[9]  B. Efron Student's t-Test under Symmetry Conditions , 1969 .

[10]  Allen C. Browne,et al.  UMLS language and vocabulary tools. , 2003, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[11]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[12]  Víctor Fresno-Fernández,et al.  Integrating the Probabilistic Models BM25/BM25F into Lucene , 2009, ArXiv.