论文信息 - Putting Encyclopaedia Knowledge into Structural Form: Finite State Transducers Approach

Putting Encyclopaedia Knowledge into Structural Form: Finite State Transducers Approach

In biology and functional genomics in particular, understanding the dependence and interplay between different genome and ecological characteristics of organisms is a very challenging problem. There are some public databases which combine this kind of information, but there is still much more information about microbes and other organisms that reside in unstructured and semi-structured documents, such as encyclopaedias. In this paper we present a method for extracting information from semi-structured resources, such as encyclopaedias, based on finite state transducers, consisting of two clearly distinguished phases. The first phase strongly relies on the analysis of the document structure and it is used for locating records of data in the text. The second phase is based on the finite state transducers created for extracting the data, which can be modified so as to achieve the preferred efficiency and it is used for extracting the particular characteristic from the text. We show how the two phase method is applied to the text of the encyclopaedia "Systematic Bacteriology". A fully structured database with genotype and phenotype characteristics of organisms has been created from the encyclopaedia unstructured descriptions.

Vesna Pajic

[1] Duško Vitas,et al. Processing Serbian Written Texts : An Overview of Resources and Basic Tools , 2006 .

[2] Robert G. Beiko,et al. Efficient learning of microbial genotype-phenotype association rules , 2010, Bioinform..

[3] Jacques Labelle. Le traitement automatique des variantes linguistiques en français : L'exemple des concrets , 1995 .

[4] Eduard H. Hovy,et al. Intelligent Approaches to Mining the Primary Research Literature: Techniques, Systems, and Examples , 2008, Computational Intelligence in Medical Informatics.

[5] Ping Zhong,et al. Web Information Extraction Using Generalized Hidden Markov Model , 2006, 2006 1st IEEE Workshop on Hot Topics in Web Systems and Technologies.

[6] M. Silberztein,et al. Dictionnaires électroniques du français , 1990 .

[7] Kalina Bontcheva,et al. Towards a semantic extraction of named entities , 2003 .

[8] Peer Bork,et al. Systematic Association of Genes to Phenotypes by Genome and Literature Mining , 2005, PLoS biology.

[9] Matthieu Constant,et al. Outilex, plate-forme logicielle de traitement de textes écrits , 2007, JEPTALNRECITAL.

[10] Max Silberztein,et al. Dictionnaires électroniques et analyse automatique de textes : le système intex , 1993 .

[11] Mark Craven,et al. Representing Sentence Structure in Hidden Markov Models for Information Extraction , 2001, IJCAI.