Report on the TREC 2004 Experiment: Genomics Track

Summary Because of corruptions in the XML TREC Genomics collection, which were detected only some days before the submission deadline, we were not able to submit runs for the ad hoc retrieval task (task I), although relevance judgements made after polling were used to evaluate our approaches, and therefore this report mostly focuses on the text categorization task (task II: triage and annotation). Task I. Our approach uses thesaural resources (from the UMLS) together with a variant of the Porter stemmer for string normalization. Gene and Protein Entities (GPE) of the collection were simply marked up by dictionary look up during the indexing in order to avoid erroneous conflation: strings not found in the UMLS Specialist lexicon (augmented with various English lexical resources) were considered as GPE and were moderately overweighed. Two different weighting schemas were tested: first, a standard tf-idf with cosine normalization, second a weighting based on the deviation from randomness model. For indexing the Genomic collection, the following MEDLINE records were selected: article’s titles, MeSH and RN terms, and abstract fields. We investigated the use of high-precisions strategies and our system returned only highly reliable documents so that some queries were not answered by the system. Our best run achieved an average precision of 32% (ranked 6 out of 27 participants). The score was obtained using UMLS resources and GPE (Gene and Protein Entity) tagging together with a combination of a classical atc.ltn schema (following SMART notation) with a deviation from randomness [8] weighting: L(ne)C2 and KL for expansion.

[1]  Alan F. Smeaton,et al.  On the Use of MeSH Headings to Improve Retrieval Effectiveness , 2003, TREC.

[2]  Jimmy J. Lin,et al.  Fusion of Knowledge-Intensive and Statistical Approaches for Retrieving and Annotating Textual Genomics Documents , 2005, TREC.

[3]  Marc Moens,et al.  Argumentative Classification of Extracted Sentences as a First Step Towards Flexible Abstracting , 1999 .

[4]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[5]  D. Rebholz-Schuhmann,et al.  Computer-assisted generation of a protein-interaction database for nuclear receptors. , 2003, Molecular endocrinology.

[6]  Patrick Ruch,et al.  Evaluation of Stemming, Query Expansion and Manual Indexing Approaches for the Genomic Task , 2005, TREC.

[7]  Jacques Savoy,et al.  Report on the TREC 2003 Experiment: Genomic and Web Searches , 2003, TREC.

[8]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[9]  Joyce A. Mitchell,et al.  Gene Indexing: Characterization and Analysis of NLM's GeneRIFs , 2003, AMIA.

[10]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..

[11]  Robert H. Baud,et al.  Learning-Free Text Categorization , 2003, AIME.

[12]  Sumio Fujita Revisiting Again Document Length Hypotheses TREC 2004 Genomics Track Experiments at Patolis , 2004, TREC.

[13]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[14]  Patrick Ruch,et al.  Automatic assignment of biomedical categories: toward a generic approach , 2006, Bioinform..

[15]  John M. Swales,et al.  Genre Analysis: English in Academic and Research Settings , 1993 .

[16]  Robert H. Baud,et al.  Minimal Commitment and Full Lexical Disambiguation: Balancing Rules and Hidden Markov Models , 2000, CoNLL/LLL.

[17]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[18]  Patrick Ruch,et al.  Using Argumentation to Retrieve Articles with Similar Citations from MEDLINE , 2004, NLPBA/BioNLP.

[19]  Miguel E. Ruiz Experiments on Genomics Ad Hoc Retrieval , 2005, TREC.

[20]  Patrick Ruch,et al.  Finding Relevant Passages in Scientific Articles: Fusion of Automatic Approaches vs. an Interactive Team Effort , 2006, TREC.

[21]  Patrick Ruch,et al.  Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-Prot , 2005, BMC Bioinformatics.

[22]  Claire Nedellec,et al.  Sentence Filtering for Information Extraction in Genomics, a Classification Problem , 2001, PKDD.

[23]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[24]  Christian Lovis,et al.  Building Medical Dictionaries for Patient Encoding Systems: A Methodology , 1997, AIME.

[25]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[26]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[27]  Anne-Lise Veuthey,et al.  A Probabilistic Information Retrieval Approach to Medical Annotation in SWISS-PROT , 2003, MIE.

[28]  Marc Moens,et al.  Sentence extraction and rhetorical classification for flexible abstracts , 1998 .

[29]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[30]  Mehmet Kayaalp,et al.  Methods for Accurate Retrieval of MEDLINE Citations in Functional Genomics , 2003, TREC.

[31]  William R. Hersh,et al.  TREC GENOMICS Track Overview , 2003, TREC.

[32]  Patrick Ruch,et al.  Report on the TREC 2003 Experiment: Genomic Track , 2003, TREC.

[33]  Nigel Collier,et al.  Zone Identification in Biology Articles as a Basis for Information Extraction , 2004, NLPBA/BioNLP.

[34]  Padmini Srinivasan,et al.  Optimal Document-Indexing Vocabulary for MEDLINE , 1996, Inf. Process. Manag..