Large-scaled information extraction plays an important role in advanced information retrieval as well as question answering and summarization. Information extraction can be defined as a process of converting unstructured documents into formalized, tabular information, which consists of named-entity recognition, terminology extraction, coreference resolution and relation extraction. Since all the elementary technologies have been studied independently so far, it is not trivial to integrate all the necessary processes of information extraction due to the diversity of their input/output formation approaches and operating environments. As a result, it is difficult to handle scientific documents to extract both named-entities and technical terms at once. In this study, we define scientific as a set of 10 types of named entities and technical terminologies in a biomedical domain. in order to automatically extract these entities from scientific documents at once, we develop a framework for scientific core entity extraction which embraces all the pivotal language processors, named-entity recognizer, co-reference resolver and terminology extractor. Each module of the integrated system has been evaluated with various corpus as well as KEEC 2009. The system will be utilized for various information service areas such as information retrieval, question-answering(Q&A), document indexing, dictionary construction, and so on.
[1]
Beatrice Santorini,et al.
Building a Large Annotated Corpus of English: The Penn Treebank
,
1993,
CL.
[2]
Shalom Lappin,et al.
An Algorithm for Pronominal Anaphora Resolution
,
1994,
CL.
[3]
Jian Su,et al.
A Twin-Candidate Model for Learning-Based Anaphora Resolution
,
2008,
Computational Linguistics.
[4]
Jian Su,et al.
Recognizing Names in Biomedical Texts: a Machine Learning Approach
,
2004
.
[5]
Savas Yildirim,et al.
Learning-based pronoun resolution for Turkish with a comparative evaluation
,
2009,
Comput. Speech Lang..
[6]
Giorgio Valle,et al.
The Gene Ontology project in 2008
,
2007,
Nucleic Acids Res..
[7]
Jotun Hein,et al.
A nucleotide substitution model with nearest-neighbour interactions
,
2004,
ISMB/ECCB.
[8]
Hwee Tou Ng,et al.
A Machine Learning Approach to Coreference Resolution of Noun Phrases
,
2001,
CL.
[9]
Jian Su,et al.
Enhancing HMM-based biomedical named entity recognition by studying special phenomena
,
2004,
J. Biomed. Informatics.
[10]
Lorraine K. Tanabe,et al.
Tagging gene and protein names in biomedical text
,
2002,
Bioinform..