BiTeM/SIBtex group proceedings for BioCreative IV, Track 4

For the BioCreative IV Track 4, we exploited the power of our machine learning Gene Ontology classifier, GOCat. GOCat computes similarities between an input text and already curated instances in order to infer GO terms. GO Annotations (GOA) and MEDLINE are used for populating the knowledge base (almost 100000 curated abstracts). For the subtask A, we designed a state-of-the-art statistical approach, using a naïve Bayes classifier and the official training set. We also investigated exploiting GeneRIFs for an alternative forty times bigger training set, but the results were disappointing, probably because of the lack of correct negative instances. For the subtask B, we applied GOCat to the first subtask output and reached promising results, up to 0.65 for Recall at 20 with hierarchical metrics. Thanks to BioCreative IV, we were able to design a complete workflow for curation. Given a gene name and a full text, this system is able to deliver highly relevant GO terms along with a set of evidence sentences; observed performances are sufficient for being used in a real semi-automatic curation workflow. Introduction The problem of data deluge in proteomics is well known: the available curated data lag behind current biological knowledge contained in the literature (1–3), and professional curators needs assistance from text mining in order to keep up with the literature (4–6). One particularly timeconsuming and labor-intensive task is gene function curation of a full text with Gene Ontology (GO) terms. Such curation from literature is a highly complex task, because it needs expertise in genomics but also in the ontology itself. For that matter, this task was studied since the first BioCreative challenge in 2005 (7) and is still considered as both unachieved, and long-awaited by the community (8). Our group participated in the first BioCreative. At this time, we extracted GO terms from full texts with EAGL, a locally developed Dictionary-Based classifier (9). Dictionary-Based