Towards a database for genotype-phenotype association research: mining data from encyclopaedia

To associate phenotypic characteristics of an organism to molecules encoded by its genome, there is a need for well-structured genotype and phenotype data. We use a novel method for extracting data on phenotype and genotype characteristics of microorganisms from text. As a resource, we use an encyclopedia of microorganisms, which holds phenotypic and genotypic data and create a structured, flexible data resource, which can be exported to a range of database formats, containing genotype and phenotype data for 2412 species and 873 genera of microbes. This data source has great potential as a resource for future biological research on genotype-phenotype associations. In this paper, we focus on describing the structure and content of the resulting database and on evaluating the method used for extracting the data. We conclude that the resulting database can be used as a reliable complementary resource for research into genotype-phenotype association.

[1]  Monte Westerfield,et al.  The Zebrafish Information Network: the zebrafish model organism database , 2005, Nucleic Acids Res..

[2]  Javed Mostafa,et al.  Discovering implicit associations among critical biological entities , 2009, Int. J. Data Min. Bioinform..

[3]  C Burks,et al.  The GenBank genetic sequence data bank. , 1988, Nucleic acids research.

[4]  Agata Savary Recensement et description des mots composés - méthodes et applications , 2000 .

[5]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[6]  M. Silberztein,et al.  Dictionnaires électroniques du français , 1990 .

[7]  Wei Wang,et al.  A Novel Knowledge-Driven Systems Biology Approach for Phenotype Prediction upon Genetic Intervention , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  James W. Fickett,et al.  The GenBank genetic sequence databank , 1986, Nucleic Acids Res..

[9]  Robert G. Beiko,et al.  Efficient learning of microbial genotype-phenotype association rules , 2010, Bioinform..

[10]  Peer Bork,et al.  Systematic Association of Genes to Phenotypes by Genome and Literature Mining , 2005, PLoS biology.

[11]  Mahmood A. Mahdavi Medical informatics: transition from data acquisition to data analysis by means of bioinformatics tools and resources , 2010, Int. J. Data Min. Bioinform..

[12]  Mikel L. Forcada,et al.  Efficient Parsing Using Recursive Transition Networks with Output , 2007, LTC.

[13]  Revised February Extended finite state models of language , 1997 .

[14]  Douglas E. Appelt,et al.  FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text , 1997, ArXiv.

[15]  S. Tasker,et al.  Bergey’s Manual of Systematic Bacteriology , 2010 .

[16]  D. Ware,et al.  The Gramene Genetic Diversity Module: a resource for genotype-phenotype association analysis in grass species , 2010 .

[17]  Mark Craven,et al.  Representing Sentence Structure in Hidden Markov Models for Information Extraction , 2001, IJCAI.

[18]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[19]  Javier Miguel Sastre-Martínez Efficient Parsing Using Filtered-Popping Recursive Transition Networks , 2009, CIAA.

[20]  Wei-Ying Ma,et al.  2D Conditional Random Fields for Web information extraction , 2005, ICML.

[21]  Francisco Casacuberta,et al.  Inference of finite-state transducers from regular languages , 2005, Pattern Recognit..

[22]  Gudmundur A. Thorisson,et al.  Genotype–phenotype databases: challenges and solutions for the post-genomic era , 2009, Nature Reviews Genetics.

[23]  Maurice Gross,et al.  Electronic Dictionaries and Automata in Computational Linguistics, LITP Spring School on Theoretical Computer Science, Saint-Pierre d'Oléron, France, May 25-29, 1987, Proceedings , 1989, Electronic Dictionaries and Automata in Computational Linguistics.

[24]  Eduard H. Hovy,et al.  Extracting Data Records from Unstructured Biomedical Full Text , 2007, EMNLP.

[25]  Maurice Gross,et al.  Electronic Dictionaries and Automata in Computational Linguistics , 1987, Lecture Notes in Computer Science.

[26]  Mathew J. Palakal,et al.  An on demand data integration model for biological databases , 2009, Int. J. Data Min. Bioinform..

[27]  Mark Gerstein,et al.  Integration of curated databases to identify genotype-phenotype associations , 2006, BMC Genomics.

[28]  Denis Maurel,et al.  Finite-state transducer cascades to extract named entities in texts , 2004, Theor. Comput. Sci..

[29]  Patrik D'haeseleer,et al.  Microbial genotype–phenotype mapping by class association rule mining , 2008, Bioinform..

[30]  Blandine Courtois Formes Ambiguës de la Langue Française , 1996 .

[31]  Duško Vitas,et al.  Corpus and Lexicon. Mutual Incompleteness , 2005 .

[32]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[33]  T. Hansen Bergey's Manual of Systematic Bacteriology , 2005 .

[34]  Tammy L. Root,et al.  Association of candidate genes with phenotypic traits relevant to anorexia nervosa. , 2011, European eating disorders review : the journal of the Eating Disorders Association.

[35]  Duško Vitas,et al.  Processing Serbian Written Texts : An Overview of Resources and Basic Tools , 2006 .

[36]  Xiaoyan Zhu,et al.  Discovering breast cancer drug candidates from biomedical literature , 2010, Int. J. Data Min. Bioinform..

[37]  Matej Oresic,et al.  An integrative approach for biological data mining and visualisation , 2008, Int. J. Data Min. Bioinform..

[38]  Jacques Labelle Le traitement automatique des variantes linguistiques en français : L'exemple des concrets , 1995 .

[39]  Eduard H. Hovy,et al.  Intelligent Approaches to Mining the Primary Research Literature: Techniques, Systems, and Examples , 2008, Computational Intelligence in Medical Informatics.

[40]  Antonio Jimeno-Yepes,et al.  Exploitation of ontological resources for scientific literature analysis: Searching genes and related diseases , 2009, 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[41]  Ping Zhong,et al.  Web Information Extraction Using Generalized Hidden Markov Model , 2006, 2006 1st IEEE Workshop on Hot Topics in Web Systems and Technologies.

[42]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[43]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[44]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[45]  Mona Singh,et al.  A cross-genomic approach for systematic mapping of phenotypic traits to genes. , 2003, Genome research.

[46]  P. H. A. Sneath,et al.  Sergey's Manual of Systematic Bacteriology — Volume 2 , 1987, 1987.

[47]  K. Wise,et al.  A versatile palindromic amphipathic repeat coding sequence horizontally distributed among diverse bacterial and eucaryotic microbes , 2010, BMC Genomics.

[48]  Wolfgang Ludwig,et al.  Road map of the phyla Bacteroidetes , Spirochaetes , Tenericutes ( Mollicutes ), Acidobacteria , Fibrobacteres , Fusobacteria , Dictyoglomi , Gemmatimonadetes , Lentisphaerae , Verrucomicrobia , Chlamydiae , and Planctomycetes , 2015 .

[49]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[50]  Vesna Pajic,et al.  Information Extraction from Semi-structured Resources: A Two-Phase Finite State Transducers Approach , 2011, CIAA.