Automated recognition of malignancy mentions in biomedical literature

BackgroundThe rapid proliferation of biomedical text makes it increasingly difficult for researchers to identify, synthesize, and utilize developed knowledge in their fields of interest. Automated information extraction procedures can assist in the acquisition and management of this knowledge. Previous efforts in biomedical text mining have focused primarily upon named entity recognition of well-defined molecular objects such as genes, but less work has been performed to identify disease-related objects and concepts. Furthermore, promise has been tempered by an inability to efficiently scale approaches in ways that minimize manual efforts and still perform with high accuracy. Here, we have applied a machine-learning approach previously successful for identifying molecular entities to a disease concept to determine if the underlying probabilistic model effectively generalizes to unrelated concepts with minimal manual intervention for model retraining.ResultsWe developed a named entity recognizer (MTag), an entity tagger for recognizing clinical descriptions of malignancy presented in text. The application uses the machine-learning technique Conditional Random Fields with additional domain-specific features. MTag was tested with 1,010 training and 432 evaluation documents pertaining to cancer genomics. Overall, our experiments resulted in 0.85 precision, 0.83 recall, and 0.84 F-measure on the evaluation set. Compared with a baseline system using string matching of text with a neoplasm term list, MTag performed with a much higher recall rate (92.1% vs. 42.1% recall) and demonstrated the ability to learn new patterns. Application of MTag to all MEDLINE abstracts yielded the identification of 580,002 unique and 9,153,340 overall mentions of malignancy. Significantly, addition of an extensive lexicon of malignancy mentions as a feature set for extraction had minimal impact in performance.ConclusionTogether, these results suggest that the identification of disparate biomedical entity classes in free text may be achievable with high accuracy and only moderate additional effort for each new application domain.

[1]  K. Bretonnel Cohen,et al.  BioCreAtIvE Task1A: entity identification with a stochastic tagger , 2005, BMC Bioinformatics.

[2]  Masaki Murata,et al.  Gene/protein name recognition based on support vector machine using dictionary as features , 2005, BMC Bioinformatics.

[3]  C. Street,et al.  The Cancer Biomedical Informatics Grid (caBIGTM) , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[4]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[5]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[6]  George Hripcsak,et al.  Natural language processing in an operational clinical information system , 1995, Natural Language Engineering.

[7]  Manabu Torii,et al.  Using name-internal and contextual features to classify biological terms , 2004, J. Biomed. Informatics.

[8]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[9]  Gene Ontology Consortium,et al.  The Gene Ontology (GO) project in 2006 , 2005, Nucleic Acids Res..

[10]  Yang Jin,et al.  An entity tagger for recognizing acquired genomic variations in cancer literature , 2004, Bioinform..

[11]  Alexander A. Morgan,et al.  BioCreAtIvE Task 1A: gene mention finding evaluation , 2005, BMC Bioinformatics.

[12]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[13]  Martin Romacker,et al.  MedSynDikate - a natural language system for the extraction of medical information from findings reports , 2002, Int. J. Medical Informatics.

[14]  Javier Tamames,et al.  Text Detective: a rule-based system for gene annotation in biomedical texts , 2005, BMC Bioinformatics.

[15]  Carol Friedman,et al.  Extracting Phenotypic Information from the Literature via Natural Language Processing , 2004, MedInfo.

[16]  Seth Kulick,et al.  Shallow Semantic Annotation of Biomedical Corpora for Information Extraction , 2003 .

[17]  Ulf Leser,et al.  Systematic feature evaluation for gene name recognition , 2005, BMC Bioinformatics.

[18]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[19]  Seth Kulick,et al.  Integrated Annotation for Biomedical Information Extraction , 2004, HLT-NAACL 2004.

[20]  Mark Craven,et al.  Hierarchical Hidden Markov Models for Information Extraction , 2003, IJCAI.

[21]  Jian Su,et al.  Recognition of protein/gene names from text using an ensemble of classifiers , 2005, BMC Bioinformatics.

[22]  Holger Moch,et al.  Morphologic and Molecular Characterization of Renal Cell Carcinoma in Children and Young Adults , 2004, The American journal of surgical pathology.

[23]  G. Vriend,et al.  A text-mining analysis of the human phenome , 2006, European Journal of Human Genetics.

[24]  Kerry K Kakazu,et al.  The Cancer Biomedical Informatics Grid (caBIG): pioneering an expansive network of information and tools for collaborative cancer research. , 2004, Hawaii medical journal.

[25]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[26]  Jules J Berman,et al.  Tumor taxonomy for the developmental lineage classification of neoplasms , 2004, BMC Cancer.

[27]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[28]  Nigel Collier,et al.  Comparison of character-level and part of speech features for name recognition in biomedical texts , 2004, J. Biomed. Informatics.

[29]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.