Putting Encyclopaedia Knowledge into Structural Form: Finite State Transducers Approach

In biology and functional genomics in particular, understanding the dependence and interplay between different genome and ecological characteristics of organisms is a very challenging problem. There are some public databases which combine this kind of information, but there is still much more information about microbes and other organisms that reside in unstructured and semi-structured documents, such as encyclopaedias. In this paper we present a method for extracting information from semi-structured resources, such as encyclopaedias, based on finite state transducers, consisting of two clearly distinguished phases. The first phase strongly relies on the analysis of the document structure and it is used for locating records of data in the text. The second phase is based on the finite state transducers created for extracting the data, which can be modified so as to achieve the preferred efficiency and it is used for extracting the particular characteristic from the text. We show how the two phase method is applied to the text of the encyclopaedia "Systematic Bacteriology". A fully structured database with genotype and phenotype characteristics of organisms has been created from the encyclopaedia unstructured descriptions.

[1]  Duško Vitas,et al.  Processing Serbian Written Texts : An Overview of Resources and Basic Tools , 2006 .

[2]  Robert G. Beiko,et al.  Efficient learning of microbial genotype-phenotype association rules , 2010, Bioinform..

[3]  Jacques Labelle Le traitement automatique des variantes linguistiques en français : L'exemple des concrets , 1995 .

[4]  Eduard H. Hovy,et al.  Intelligent Approaches to Mining the Primary Research Literature: Techniques, Systems, and Examples , 2008, Computational Intelligence in Medical Informatics.

[5]  Ping Zhong,et al.  Web Information Extraction Using Generalized Hidden Markov Model , 2006, 2006 1st IEEE Workshop on Hot Topics in Web Systems and Technologies.

[6]  M. Silberztein,et al.  Dictionnaires électroniques du français , 1990 .

[7]  Kalina Bontcheva,et al.  Towards a semantic extraction of named entities , 2003 .

[8]  Peer Bork,et al.  Systematic Association of Genes to Phenotypes by Genome and Literature Mining , 2005, PLoS biology.

[9]  Matthieu Constant,et al.  Outilex, plate-forme logicielle de traitement de textes écrits , 2007, JEPTALNRECITAL.

[10]  Max Silberztein,et al.  Dictionnaires électroniques et analyse automatique de textes : le système intex , 1993 .

[11]  Mark Craven,et al.  Representing Sentence Structure in Hidden Markov Models for Information Extraction , 2001, IJCAI.

[12]  Agata Savary Recensement et description des mots composés - méthodes et applications , 2000 .

[13]  Emmanuel Roche,et al.  Finite state transducers: parsing free and frozen sentences , 1999 .

[14]  Maurice Gross,et al.  Electronic Dictionaries and Automata in Computational Linguistics , 1987, Lecture Notes in Computer Science.

[15]  Revised February Extended finite state models of language , 1997 .

[16]  Douglas E. Appelt,et al.  FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text , 1997, ArXiv.

[17]  Mark Gerstein,et al.  Integration of curated databases to identify genotype-phenotype associations , 2006, BMC Genomics.

[18]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[19]  Patrik D'haeseleer,et al.  Microbial genotype–phenotype mapping by class association rule mining , 2008, Bioinform..

[20]  Eduard H. Hovy,et al.  Extracting Data Records from Unstructured Biomedical Full Text , 2007, EMNLP.

[21]  Blandine Courtois Formes Ambiguës de la Langue Française , 1996 .

[22]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[23]  Mikel L. Forcada,et al.  Efficient Parsing Using Recursive Transition Networks with Output , 2007, LTC.

[24]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[25]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[26]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[27]  Wei-Ying Ma,et al.  2D Conditional Random Fields for Web information extraction , 2005, ICML.

[28]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[29]  Duško Vitas,et al.  Corpus and Lexicon. Mutual Incompleteness , 2005 .

[30]  Mona Singh,et al.  A cross-genomic approach for systematic mapping of phenotypic traits to genes. , 2003, Genome research.

[31]  Javier Miguel Sastre-Martínez Efficient Parsing Using Filtered-Popping Recursive Transition Networks , 2009, CIAA.

[32]  Francisco Casacuberta,et al.  Inference of finite-state transducers from regular languages , 2005, Pattern Recognit..

[33]  J. T. Staley,et al.  The alpha-, beta-, delta-, and epsilonproteobacteria , 2005 .

[34]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[35]  Satoshi Sekine,et al.  Named entities : recognition, classification and use , 2009 .

[36]  Denis Maurel,et al.  Finite-state transducer cascades to extract named entities in texts , 2004, Theor. Comput. Sci..