A Study on Collecting and Structuring Language Resource for Named Entity Recognition and Relation Extraction from Biomedical Abstracts

This paper introduces an integrated model for systematically constructing a linguistic resource database that can be used by machine learning-based biomedical information extraction systems. The proposed method suggests an orderly process of collecting and constructing dictionaries and training sets for both named-entity recognition and relation extraction. Multiple heterogeneous structures for the resources which are collected from diverse sources are analyzed to derive essential items and fields for constructing the integrated database. All the collected resources are converted and refined to build an integrated linguistic resource storage. In this paper, we constructed entity dictionaries of gene, protein, disease and drug, which are considered core linguistic elements or core named entities in the biomedical domains and conducted verification tests to measure their acceptability. 키워드: 정보 추출, 개체명 인식, 관계 추출, 바이오 텍스트 마이닝, 학습 집합 Information Extraction, Named-Entity Recognition, Relation Extraction, Bio-text Mining, Training Set

[1]  Harksoo Kim,et al.  A Semi-automatic Construction method of a Named Entity Dictionary Based on Wikipedia , 2015 .

[2]  B. Blencowe,et al.  The nuclear-retained noncoding RNA MALAT1 regulates alternative splicing by modulating SR splicing factor phosphorylation. , 2010, Molecular cell.

[3]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[4]  W. Raub From the National Institutes of Health. , 1990, JAMA.

[5]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[6]  Sung-Pil Choi,et al.  Extraction of protein–protein interactions (PPIs) from the literature by deep convolutional neural networks with various feature embeddings , 2018, J. Inf. Sci..

[7]  Keeheon Lee,et al.  Inferring Undiscovered Public Knowledge by Using Text Mining-driven Graph Model , 2014, DTMBIO '14.

[8]  K. Musier-Forsyth,et al.  Transfer RNA recognition by aminoacyl‐tRNA synthetases , 1999, Biopolymers.

[9]  Kyu-Baek Hwang,et al.  A Bio-Text Mining System Based on Natural Language Processing , 2011 .

[10]  Sophia Ananiadou,et al.  The National Centre for Text Mining: Aims and Objectives , 2005 .

[11]  Zhiyong Lu,et al.  Community challenges in biomedical text mining over 10 years: success, failure and the future , 2016, Briefings Bioinform..

[12]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[13]  Yue Wang,et al.  The Genia Event Extraction Shared Task, 2013 Edition - Overview , 2013, BioNLP@ACL.

[14]  Martin Hofmann-Apitius,et al.  Weakly Labeled Corpora as Silver Standard for Drug-Drug and Protein-Protein Interaction , 2012, LREC 2012.