EnzyMiner: automatic identification of protein level mutations and their impact on target enzymes from PubMed abstracts

BackgroundA better understanding of the mechanisms of an enzyme's functionality and stability, as well as knowledge and impact of mutations is crucial for researchers working with enzymes. Though, several of the enzymes' databases are currently available, scientific literature still remains at large for up-to-date source of learning the effects of a mutation on an enzyme. However, going through vast amounts of scientific documents to extract the information on desired mutation has always been a time consuming process. In this paper, therefore, we describe an unique method, termed as EnzyMiner, which automatically identifies the PubMed abstracts that contain information on the impact of a protein level mutation on the stability and/or the activity of a given enzyme.ResultsWe present an automated system which identifies the abstracts that contain an amino-acid-level mutation and then classifies them according to the mutation's effect on the enzyme. In the case of mutation identification, MuGeX, an automated mutation-gene extraction system has an accuracy of 93.1% with a 91.5 F-measure. For impact analysis, document classification is performed to identify the abstracts that contain a change in enzyme's stability or activity resulting from the mutation. The system was trained on lipases and tested on amylases with an accuracy of 85%.ConclusionEnzyMiner identifies the abstracts that contain a protein mutation for a given enzyme and checks whether the abstract is related to a disease with the help of information extraction and machine learning techniques. For disease related abstracts, the mutation list and direct links to the abstracts are retrieved from the system and displayed on the Web. For those abstracts that are related to non-diseases, in addition to having the mutation list, the abstracts are also categorized into two groups. These two groups determine whether the mutation has an effect on the enzyme's stability or functionality followed by displaying these on the web.

[1]  Per Berglund,et al.  Engineered enzymes for improved organic synthesis. , 2003, Current opinion in biotechnology.

[2]  Antje Chang,et al.  BRENDA, AMENDA and FRENDA the enzyme information system: new content and tools in 2009 , 2008, Nucleic Acids Res..

[3]  Mounia Lalmas,et al.  A probabilistic description-oriented approach for categorizing web documents , 1999, CIKM '99.

[4]  Fred E. Cohen,et al.  Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors , 2004, Bioinform..

[5]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[6]  Robert N. Goldberg,et al.  Thermodynamics of enzyme-catalyzed reactions - a database for quantitative biochemistry , 2004, Bioinform..

[7]  René Witte,et al.  Mutation Mining—A Prospector's Tale , 2006, Inf. Syst. Frontiers.

[8]  G. Casari,et al.  Automatic extraction of mutations from Medline and cross-validation with OMIM. , 2004, Nucleic acids research.

[9]  Juliane Fluck,et al.  The Autoimmune Disease Database: a dynamically compiled literature-derived database , 2006, BMC Bioinformatics.

[10]  Leah S. Larkey,et al.  A patent search and classification system , 1999, DL '99.

[11]  Jun'ichi Tsujii,et al.  Boosting Precision and Recall of Dictionary-Based Protein Name Recognition , 2003, BioNLP@ACL.

[12]  Fred E. Cohen,et al.  Automatic Extraction of Protein Point Mutations Using a Graph Bigram Association , 2007, PLoS Comput. Biol..

[13]  Osman Ugur Sezerman,et al.  Application of Automatic Mutation-gene Pair Extraction to Diseases , 2007, J. Bioinform. Comput. Biol..

[14]  Nozomi Nagano,et al.  EzCatDB: the Enzyme Catalytic-mechanism Database , 2004, Nucleic Acids Res..

[15]  Kanagasabai Rajaraman,et al.  A Workflow for Mutation Extraction and Structure Annotation , 2007, J. Bioinform. Comput. Biol..

[16]  René Witte,et al.  Towards a Systematic Evaluation of protein Mutation Extraction Systems , 2007, J. Bioinform. Comput. Biol..

[17]  Veronika Vincze,et al.  Simple approaches to disease classification based on clinical patient records , 2008 .

[18]  Christopher J O Baker Making sense of mutation requires knowledge management. , 2007, Journal of bioinformatics and computational biology.

[19]  Rolf Apweiler,et al.  IntEnz, the integrated relational enzyme database , 2004, Nucleic Acids Res..

[20]  G. Narasimhan,et al.  Rational design of thermally stable proteins: relevance to bionanotechnology. , 2005, Journal of nanoscience and nanotechnology.

[21]  Peter Murray-Rust,et al.  MACiE (Mechanism, Annotation and Classification in Enzymes): novel tools for searching catalytic mechanisms , 2006, Nucleic Acids Res..

[22]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[23]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[24]  Parul Rastogi,et al.  Critical analysis of WSD algorithms , 2009, ICAC3 '09.

[25]  René Witte,et al.  Combining Biological Databases and Text Mining to Support New Bioinformatics Applications , 2005, NLDB.

[26]  K. Bretonnel Cohen,et al.  MutationFinder: a high-performance system for extracting point mutation mentions from text , 2007, Bioinform..

[27]  René Witte,et al.  Enhanced semantic access to the protein engineering literature using ontologies populated by text mining , 2007, Int. J. Bioinform. Res. Appl..

[28]  Cyril Cleverdon,et al.  Optimizing convenient online access to bibliographic databases , 1984 .

[29]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[30]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[31]  Antje Chang,et al.  BRENDA, AMENDA and FRENDA: the enzyme information system in 2007 , 2007, Nucleic Acids Res..

[32]  D. Crawford Introduction , 2008, CACM.

[33]  Russ B. Altman,et al.  GAPSCORE: finding gene and protein names one word at a time , 2004, Bioinform..