Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-Prot

BackgroundIn the context of the BioCreative competition, where training data were very sparse, we investigated two complementary tasks: 1) given a Swiss-Prot triplet, containing a protein, a GO (Gene Ontology) term and a relevant article, extraction of a short passage that justifies the GO category assignement; 2) given a Swiss-Prot pair, containing a protein and a relevant article, automatic assignement of a set of categories.MethodsSentence is the basic retrieval unit. Our classifier computes a distance between each sentence and the GO category provided with the Swiss-Prot entry. The Text Categorizer computes a distance between each GO term and the text of the article. Evaluations are reported both based on annotator judgements as established by the competition and based on mean average precision measures computed using a curated sample of Swiss-Prot.ResultsOur system achieved the best recall and precision combination both for passage retrieval and text categorization as evaluated by official evaluators. However, text categorization results were far below those in other data-poor text categorization experiments The top proposed term is relevant in less that 20% of cases, while categorization with other biomedical controlled vocabulary, such as the Medical Subject Headings, we achieved more than 90% precision. We also observe that the scoring methods used in our experiments, based on the retrieval status value of our engines, exhibits effective confidence estimation capabilities.ConclusionFrom a comparative perspective, the combination of retrieval and natural language processing methods we designed, achieved very competitive performances. Largely data-independent, our systems were no less effective that data-intensive approaches. These results suggests that the overall strategy could benefit a large class of information extraction tasks, especially when training data are missing. However, from a user perspective, results were disappointing. Further investigations are needed to design applicable end-user text mining tools for biologists.

[1]  W Stolz,et al.  A Probabilistic Procedure for Grouping Words into Phrases , 1965, Language and speech.

[2]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[3]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[4]  Philip J. Hayes,et al.  CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.

[5]  Yiming Yang,et al.  A Linear Least Squares Fit Mapping Method for Information Retrieval From Natural Language Texts , 1992, COLING.

[6]  Tomek Strzalkowski,et al.  Natural Language Information Retrieval: TREC-8 Report , 1994, TREC.

[7]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[8]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[9]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[10]  Hinrich Schütze,et al.  Xerox TREC-5 Site Report: Routing, Filtering, NLP, and Spanish Tracks , 1996, TREC.

[11]  Yiming Yang Sampling Strategies and Learning Efficiency in Text Categorization , 1996 .

[12]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[13]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[14]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[15]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[16]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[17]  I. Barry Crabtree,et al.  Identifying and tracking changing interests , 1998, International Journal on Digital Libraries.

[18]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999 , 1999, Nucleic Acids Res..

[19]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[20]  David A. Hull Xerox TREC-8 Question Answering Track Report , 1999, TREC.

[21]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[22]  Hsinchun Chen,et al.  Comparing noun phrasing techniques for use with medical digital library tools , 2000, J. Am. Soc. Inf. Sci..

[23]  Avi Arampatzis,et al.  Linguistically Motivated Information Retrieval , 2000 .

[24]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[25]  Robert H. Baud,et al.  Minimal Commitment and Full Lexical Disambiguation: Balancing Rules and Hidden Markov Models , 2000, CoNLL/LLL.

[26]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[27]  Patrick Ruch Using Contextual Spelling Correction to Improve Retrieval Effectiveness in Degraded Text Collections , 2002, COLING.

[28]  Venu Dasigi,et al.  Text Categorization: An Experiment Using Phrases , 2002, ECIR.

[29]  Branimir Boguraev,et al.  Automatic Glossary Extraction: Beyond Terminology Identification , 2002, COLING.

[30]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[31]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[32]  Jacques Savoy,et al.  Term Proximity Scoring for Keyword-Based Retrieval Systems , 2003, ECIR.

[33]  Antoine Geissbühler,et al.  Using lexical disambiguation and named-entity recognition to improve spelling correction in the electronic patient record , 2003, Artif. Intell. Medicine.

[34]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[35]  William R. Hersh,et al.  TREC GENOMICS Track Overview , 2003, TREC.

[36]  Nigel Collier,et al.  Zone Identification in Biology Articles as a Basis for Information Extraction , 2004, NLPBA/BioNLP.

[37]  R. Zimmer,et al.  ProMiner: Organism-specific protein name detection using approximate string matching , 2004 .

[38]  Ellen M. Voorhees,et al.  The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[39]  Patrick Ruch,et al.  Using Argumentation to Retrieve Articles with Similar Citations from MEDLINE , 2004, NLPBA/BioNLP.

[40]  Patrick Ruch,et al.  Report on the TREC 2004 Experiment: Genomics Track , 2004, TREC.

[41]  Pedro M. Coutinho,et al.  FiGO: Finding GO Terms in Unstructured Text , 2004 .

[42]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[43]  Karin M. Verspoor,et al.  Protein annotation as term categorization in the gene ontology using word proximity networks , 2005, BMC Bioinformatics.

[44]  Fernando Pereira,et al.  Automatically annotating documents with normalized gene lists , 2005, BMC Bioinformatics.

[45]  Martin Krallinger Prediction of GO annotation by combining entity specific sentence sliding window profiles , .