[Development of a system for extracting the information of candidate tumor markers reported in biomedical literatures].

BACKGROUND Since the human genome project was completed in 2003, there have been numerous reports on cancer and related markers. This study was aimed to develop a system to extract automatically information regarding the relationship between cancer and tumor markers from biomedical literatures. METHODS Named entities of tumor markers were recognized by both a dictionary-based method and machine learning technology of the support vector machine. Named entities of cancers were recognized by the MeSH dictionary. RESULTS Relational and filtering keywords were selected after annotating 160 abstracts from PubMed. Relational information was extracted only when one of the relational keywords was in an appropriate position along the parse tree of a sentence with both tumor marker and disease entities. The performance of the system developed in this study was evaluated with another set of 77 abstracts. With the relational and filtering keyword used in the system, precision was 94.38% and recall was 66.14%, while without the expert knowledge precision was 49.16% and recall was 69.29%. CONCLUSIONS We developed a system that can extract relational information between a tumor and its markers by incorporating expert knowledge into the system. The system exploiting expert knowledge would serve as a reference when developing another information extraction system in various medical fields.

[1]  M. Kris,et al.  Clinical Cancer Advances 2005: major research advances in cancer treatment, prevention, and screening--a report from the American Society of Clinical Oncology. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[2]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[3]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[4]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[5]  F. Collins,et al.  A vision for the future of genomics research , 2003, Nature.

[6]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[7]  Proux,et al.  Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. , 1998, Genome informatics. Workshop on Genome Informatics.

[8]  Sergei Egorov,et al.  MedScan, a natural language processing engine for MEDLINE abstracts , 2003, Bioinform..

[9]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[10]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[11]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[12]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[13]  Olivier Bodenreider,et al.  Chapter 3 Lexical, terminological and ontological resources for biological text mining , 2006 .

[14]  Hae-Chang Rim,et al.  Biomedical named entity recognition using two-phase model based on SVMs , 2004, J. Biomed. Informatics.

[15]  Fred E. Cohen,et al.  Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors , 2004, Bioinform..

[16]  Sophia Ananiadou,et al.  Automatic Terminology Management in Biomedicine , 2006 .

[17]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[18]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[19]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[20]  Ian M Thompson,et al.  Prostate‐specific antigen: A review of the validation of the most commonly used cancer biomarker , 2004, Cancer.

[21]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[22]  E. Winer,et al.  Clinical cancer advances 2008: major research advances in cancer treatment, prevention, and screening--a report from the American Society of Clinical Oncology. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[23]  Rainer Blasczyk,et al.  Nomenclature for factors of the HLA system , 1998 .

[24]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[25]  Joel Waldfogel,et al.  Introduction , 2010, Inf. Econ. Policy.

[26]  Steven G.E. Marsh,et al.  Nomenclature for factors of the HLA system , 1975 .

[27]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[28]  Toshihisa Takagi,et al.  Automated extraction of information on protein-protein interactions from the biological literature , 2001, Bioinform..

[29]  W R Mayr,et al.  Nomenclature for factors of the HLA system, 2004 , 2005, Tissue antigens.