Building the Scientific Knowledge Mine (SciKnowMine): a community-driven framework for text mining t

Although there exist many high-performing text-mining tools to address literature biocuration (populating biomedical databases from the published literature), the challenge of delivering effective computational support for curation of large-scale biomedical databases is still unsolved. We describe a community-driven solution (the SciKnowMine Project) implemented using the Unstructured Information Management Architecture (UIMA) framework. This system's design is intended to provide knowledge engineering enhancement of pre-existing biocuration systems by providing a large-scale text-processing pipeline bringing together multiple Natural Language Processing (NLP) toolsets for use within well-defined biocuration tasks. By working closely with biocurators at the Mouse Genome Informatics (MGI) group at The Jackson Laboratory in the context of their everyday work, we break down the biocuration workflow into components and isolate specific targeted elements to provide maximum impact. We envisage a system for classifying documents based on a series of increasingly specific classifiers, starting with very simple surface-level decision criteria and gradually introducing more sophisticated techniques. This classification pipeline will be applied to the task of identifying papers of interest to mouse genetics (primary MGI document triage), thus facilitating the input of documents into the MGI curation pipeline. We also describe other biocuration challenges (gene normalization) and how our NLP-framework based approach could be applied to them. 1 The SciKnowMine project is funded by NSF grant #0849977 and supported by U24 RR025736-01, NIGMS: RO1-GM083871, NLM: 2R01LM009254, NLM:2R01LM008111, NLM:1R01LM010120-01, NHGRI:5P41HG000330 2 http://www.informatics.jax.org

[1]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[2]  Guus Schreiber,et al.  Knowledge Engineering and Management: The CommonKADS Methodology , 1999 .

[3]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[4]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[5]  J. Guénet the mouse genome , 2005, Nature.

[6]  Marti A. Hearst,et al.  TREC 2004 Genomics Track Overview , 2005, TREC.

[7]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[8]  David A. Ferrucci,et al.  Building an example application with the Unstructured Information Management Architecture , 2004, IBM Syst. J..

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[11]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[12]  J. Guénet The Mouse Genome , 2005 .

[13]  D. Rebholz-Schuhmann,et al.  Facts from Text—Is Text Mining Ready to Deliver? , 2005, PLoS biology.

[14]  Philip E. Bourne,et al.  Biocurators: Contributors to the World of Science , 2006, PLoS Comput. Biol..

[15]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): updates and enhancements , 2005, Nucleic Acids Res..

[16]  Eduard H. Hovy,et al.  Infrastructure for Annotation-Driven Information Extraction from the Primary Scientific Literature: Principles and Practice , 2007, 2007 IEEE Congress on Services (Services 2007).

[17]  Michael Schroeder,et al.  Inter-species normalization of gene mentions with GNAT , 2008, ECCB.

[18]  K. Bretonnel Cohen,et al.  Intrinsic Evaluation of Text Mining Tools May Not Predict Performance on Realistic Tasks , 2007, Pacific Symposium on Biocomputing.

[19]  Beatrice Alex,et al.  Assisted Curation: Does Text Mining Really Help? , 2007, Pacific Symposium on Biocomputing.

[20]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[21]  K Bretonnel Cohen,et al.  Journal of Biomedical Discovery and Collaboration Open Access an Open-source Framework for Large-scale, Flexible Evaluation of Biomedical Text Mining Systems , 2008 .

[22]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[23]  Judith A. Blake,et al.  Integrating text mining into the MGI biocuration workflow , 2009, Database J. Biol. Databases Curation.

[24]  K. Bretonnel Cohen,et al.  U-Compare: share and compare text mining tools with UIMA , 2009, Bioinform..

[25]  Mark A. Musen,et al.  The Open Biomedical Annotator , 2009, Summit on translational bioinformatics.

[26]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[27]  Judith A. Blake,et al.  The Mouse Genome Database: enhancements and updates , 2009, Nucleic Acids Res..