Cell line name recognition in support of the identification of synthetic lethality in cancer from text

Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers. Availability and implementation: The manually annotated datasets, the cell line dictionary, derived corpora, NERsuite models and the results of the large-scale run on unannotated texts are available under open licenses at http://turkunlp.github.io/Cell-line-recognition/. Contact: sukaew@utu.fi

[1]  Alexander D. Diehl,et al.  Cell Line Ontology: Redesigning the Cell Line Knowledgebase to Aid Integrative Translational Informatics , 2011, ICBO.

[2]  Jari Björne,et al.  Comparative analysis of five protein-protein interaction corpora , 2008, BMC Bioinformatics.

[3]  Paolo Romano,et al.  Cell Line Data Base: structure and recent improvements towards molecular authentication of human cell lines , 2008, Nucleic Acids Res..

[4]  Alan Ashworth,et al.  Searching for synthetic lethality in cancer. , 2011, Current opinion in genetics & development.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Su Jian,et al.  Exploring Deep Knowledge Resources in Biomedical Name Recognition , 2004, NLPBA/BioNLP.

[7]  Sampo Pyysalo,et al.  Open-domain Anatomical Entity Mention Detection , 2012, ACL 2012.

[8]  Sampo Pyysalo,et al.  EXTRACTING BIO‐MOLECULAR EVENTS FROM LITERATURE—THE BIONLP’09 SHARED TASK , 2011, Comput. Intell..

[9]  Adam A. Margolin,et al.  The Cancer Cell Line Encyclopedia enables predictive modeling of anticancer drug sensitivity , 2012, Nature.

[10]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[11]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[12]  Mariana L. Neves,et al.  Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts , 2013, Database J. Biol. Databases Curation.

[13]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[14]  Jun'ichi Tsujii,et al.  Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data , 2005, HLT.

[15]  Sampo Pyysalo,et al.  Overview of the Pathway Curation (PC) task of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[16]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[17]  U. Leser,et al.  Annotating and Evaluating Text for Stem Cell Research , 2012 .

[18]  Sampo Pyysalo,et al.  Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011 , 2012, BMC Bioinformatics.

[19]  Piero Fariselli,et al.  A three-state prediction of single point mutations on protein stability changes , 2007, BMC Bioinformatics.

[20]  Ulf Leser,et al.  A Comprehensive Benchmark of Kernel Methods to Extract Protein–Protein Interactions from Literature , 2010, PLoS Comput. Biol..

[21]  Jian Su,et al.  Exploring Deep Knowledge Resources in Biomedical Name Recognition , 2004, NLPBA/BioNLP.

[22]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[23]  Jari Björne,et al.  TEES 2.1: Automated Annotation Scheme Learning in the BioNLP 2013 Shared Task , 2013, BioNLP@ACL.

[24]  José Luís Oliveira,et al.  Gimli: open source and high-performance biomedical name recognition , 2013, BMC Bioinformatics.

[25]  Jun'ichi Tsujii,et al.  Boosting Precision and Recall of Dictionary-Based Protein Name Recognition , 2003, BioNLP@ACL.

[26]  Sophia Ananiadou,et al.  NaCTeM EventMine for BioNLP 2013 CG and PC tasks , 2013, BioNLP@ACL.

[27]  Sampo Pyysalo,et al.  Overview of the Cancer Genetics (CG) task of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[28]  Mingming Jia,et al.  COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer , 2010, Nucleic Acids Res..

[29]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.