Automatic extraction of microorganisms and their habitats from free text using text mining workflows

In this paper we illustrate the usage of text mining workflows to automatically extract instances of microorganisms and their habitats from free text; these entries can then be curated and added to different databases. To this end, we use a Conditional Random Field (CRF) based classifier, as part of the workflows, to extract the mention of microorganisms, habitats and the inter-relation between organisms and their habitats. Results indicate a good performance for extraction of microorganisms and the relation extraction aspects of the task (with a precision of over 80%), while habitat recognition is only moderate (a precision of about 65%). We also conjecture that pdf-to-text conversion can be quite noisy and this implicitly affects any sentence-based relation extraction algorithms.

[1]  Sampo Pyysalo,et al.  Named Entity Recognition for Bacterial Type IV Secretion Systems , 2011, PloS one.

[2]  S. R. Pettifer,et al.  UTOPIA—User-Friendly Tools for Operating Informatics Applications , 2004, Comparative and functional genomics.

[3]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[4]  Mark Borodovsky,et al.  The complete genome sequence of the gastric pathogen Helicobacter pylori , 1997, Nature.

[5]  Jun Yu,et al.  The complete genome of Zunongwangia profunda SM-A87 reveals its adaptation to the deep-sea environment and ecological role in sedimentary organic nitrogen degradation , 2010, BMC Genomics.

[6]  Pawel Kaleta,et al.  Comparative genomics of lactic acid bacteria reveals a niche-specific gene set , 2009, BMC Microbiology.

[7]  Ju-Hoon Lee,et al.  Comparative genomic analysis of the gut bacterium Bifidobacterium longum reveals loci susceptible to deletion during pure culture growth , 2008, BMC Genomics.

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  Nancy Argüelles,et al.  Author ' s , 2008 .

[10]  Chris F. Taylor,et al.  The minimum information about a genome sequence (MIGS) specification , 2008, Nature Biotechnology.

[11]  I-Min A. Chen,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[12]  Jun'ichi Tsujii,et al.  Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data , 2005, HLT.

[13]  Naoaki Okazaki,et al.  Semantic Search on Digital Document Repositories based on Text Mining Results , 2009 .

[14]  R. Zimmer,et al.  ProMiner: Organism-specific protein name detection using approximate string matching , 2004 .

[15]  Michael Y. Galperin,et al.  OMICS-Related Research in Latin America , 2005 .

[16]  Wen-Sheng Shu,et al.  Culturable and molecular phylogenetic diversity of microorganisms in an open-dumped, extremely acidic Pb/Zn mine tailings , 2008, Extremophiles.

[17]  K. Bretonnel Cohen,et al.  U-Compare: share and compare text mining tools with UIMA , 2009, Bioinform..

[18]  S. Tringe,et al.  Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments , 2007, Science.

[19]  C. Parker,et al.  Bmc Microbiology , 2022 .

[20]  Sophia Ananiadou,et al.  Accelerating the annotation of sparse named entities by dynamic sentence selection , 2008, BMC Bioinformatics.

[21]  Christopher D. Manning,et al.  Hierarchical Bayesian Domain Adaptation , 2009, NAACL.

[22]  Sophia Ananiadou,et al.  How to make the most of NE dictionaries in statistical NER , 2008, BMC Bioinformatics.

[23]  Sophia Ananiadou,et al.  Mining metabolites: extracting the yeast metabolome from the literature , 2010, Metabolomics.

[24]  Renzo Kottmann,et al.  Habitat-Lite: a GSC case study based on free text terms for environmental metadata. , 2008, Omics : a journal of integrative biology.

[25]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[26]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[27]  Son Doan,et al.  BioCaster: detecting public health rumors with a Web-based text mining system , 2008, Bioinform..

[28]  Y. Jan,et al.  Genomic cloning and chromosomal localization of HRY, the human homolog to the Drosophila segmentation gene, hairy. , 1994, Genomics.