Extraction of human kinase mutations from literature, databases and genotyping studies

BackgroundThere is a considerable interest in characterizing the biological role of specific protein residue substitutions through mutagenesis experiments. Additionally, recent efforts related to the detection of disease-associated SNPs motivated both the manual annotation, as well as the automatic extraction, of naturally occurring sequence variations from the literature, especially for protein families that play a significant role in signaling processes such as kinases. Systematic integration and comparison of kinase mutation information from multiple sources, covering literature, manual annotation databases and large-scale experiments can result in a more comprehensive view of functional, structural and disease associated aspects of protein sequence variants. Previously published mutation extraction approaches did not sufficiently distinguish between two fundamentally different variation origin categories, namely natural occurring and induced mutations generated through in vitro experiments.ResultsWe present a literature mining pipeline for the automatic extraction and disambiguation of single-point mutation mentions from both abstracts as well as full text articles, followed by a sequence validation check to link mutations to their corresponding kinase protein sequences. Each mutation is scored according to whether it corresponds to an induced mutation or a natural sequence variant. We were able to provide direct literature links for a considerable fraction of previously annotated kinase mutations, enabling thus more efficient interpretation of their biological characterization and experimental context. In order to test the capabilities of the presented pipeline, the mutations in the protein kinase domain of the kinase family were analyzed. Using our literature extraction system, we were able to recover a total of 643 mutations-protein associations from PubMed abstracts and 6,970 from a large collection of full text articles. When compared to state-of-the-art annotation databases and high throughput genotyping studies, the mutation mentions extracted from the literature overlap to a good extent with the existing knowledgebases, whereas the remaining mentions suggest new mutation records that were not previously annotated in the databases.ConclusionUsing the proposed residue disambiguation and classification approach, we were able to differentiate between natural variant and mutagenesis types of mutations with an accuracy of 93.88. The resulting system is useful for constructing a Gold Standard set of mutations extracted from the literature by human experts with minimal manual curation effort, providing direct pointers to relevant evidence sentences. Our system is able to recover mutations from the literature that are not present in state-of-the-art databases. Human expert manual validation of a subset of the literature extracted mutations conducted on 100 mutations from PubMed abstracts highlights that almost three quarters (72%) of the extracted mutations turned out to be correct, and more than half of these had not been previously annotated in databases.

[1]  Kanagasabai Rajaraman,et al.  A Workflow for Mutation Extraction and Structure Annotation , 2007, J. Bioinform. Comput. Biol..

[2]  Alexander V. Diemand,et al.  The Swiss‐Prot variant page and the ModSNP database: A resource for sequence and structure information on human protein variants , 2004, Human mutation.

[3]  K. Bretonnel Cohen,et al.  MutationFinder: a high-performance system for extracting point mutation mentions from text , 2007, Bioinform..

[4]  S. Antonarakis,et al.  Mutation Nomenclature , 2003, Current protocols in human genetics.

[5]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[6]  S. Perkins,et al.  CoagMDB: a database analysis of missense mutations within four conserved domains in five vitamin K–dependent coagulation serine proteases using a text‐mining tool , 2008, Human mutation.

[7]  G. Casari,et al.  Automatic extraction of mutations from Medline and cross-validation with OMIM. , 2004, Nucleic acids research.

[8]  Pierre Dubus,et al.  Cdk1 is sufficient to drive the mammalian cell cycle , 2007, Nature.

[9]  Fernando Pereira,et al.  An automated procedure to identify biomedical articles that contain cancer‐associated gene variants , 2006, Human mutation.

[10]  E. Birney,et al.  Patterns of somatic mutation in human cancer genomes , 2007, Nature.

[11]  K. Shokat,et al.  Targets of the cyclin-dependent kinase Cdk1 , 2003, Nature.

[12]  Fred E. Cohen,et al.  Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors , 2004, Bioinform..

[13]  Laura Inés Furlong,et al.  OSIRIS: a tool for retrieving literature about sequence variants , 2006, Bioinform..

[14]  T. Hunter,et al.  The Protein Kinase Complement of the Human Genome , 2002, Science.

[15]  Yoshitsugu Shiro,et al.  Structural basis of the signal transduction in the two-component system. , 2008, Advances in experimental medicine and biology.

[16]  Frances M. G. Pearl,et al.  MoKCa database—mutations of kinases in cancer , 2008, Nucleic Acids Res..

[17]  M. Vihinen,et al.  KinMutBase: A registry of disease‐causing mutations in protein kinase domains , 2005, Human mutation.

[18]  K. Bretonnel Cohen,et al.  Rapid Pattern Development for Concept Recognition Systems: Application to Point mutations , 2007, J. Bioinform. Comput. Biol..

[19]  M. Stratton,et al.  The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website , 2004, British Journal of Cancer.

[20]  Andrew C. R. Martin,et al.  Human Mutation , 2020 .

[21]  G. Parmigiani,et al.  The Consensus Coding Sequences of Human Breast and Colorectal Cancers , 2006, Science.

[22]  Roberto Basili,et al.  Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms by Thorsten Joachims , 2003, Comput. Linguistics.

[23]  Frances M. G. Pearl,et al.  The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution , 2006, Nucleic Acids Res..

[24]  E. Birney,et al.  Patterns of somatic mutation in human cancer genomes , 2007, Nature.

[25]  Laura Inés Furlong,et al.  OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature , 2008, BMC Bioinformatics.

[26]  Yang Jin,et al.  An entity tagger for recognizing acquired genomic variations in cancer literature , 2004, Bioinform..

[27]  L. Šefc,et al.  Protein kinases, their function and implication in cancer and other diseases. , 2006, Folia biologica.

[28]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[29]  A. Valencia,et al.  From cancer genomes to cancer models: bridging the gaps , 2009, EMBO reports.

[30]  Antony W Burgess,et al.  EGFR family: Structure physiology signalling and therapeutic targets† , 2008, Growth factors.

[31]  J. Minna,et al.  Distinct Epidermal Growth Factor Receptor and KRAS Mutation Patterns in Non–Small Cell Lung Cancer Patients with Different Tobacco Exposure and Clinicopathologic Features , 2006, Clinical Cancer Research.

[32]  René Witte,et al.  Towards a Systematic Evaluation of protein Mutation Extraction Systems , 2007, J. Bioinform. Comput. Biol..

[33]  J. Kuriyan,et al.  The Conformational Plasticity of Protein Kinases , 2002, Cell.

[34]  Yum Lina Yip,et al.  Retrieving Mutation-Specific Information for Human proteins in UniProt/Swiss-PROT knowledgebase , 2007, J. Bioinform. Comput. Biol..

[35]  A. Sparks,et al.  The Genomic Landscapes of Human Breast and Colorectal Cancers , 2007, Science.

[36]  P. Sanz,et al.  AMP-activated protein kinase: structure and regulation. , 2008, Current protein & peptide science.

[37]  Waldemar Celary,et al.  A comparative study on the biology of Macropis fulvipes (Fabricius, 1804) and Macropis europaea Warncke, 1973 (Hymenoptera: Apoidea: Melittidae). , 2004, Folia biologica.

[38]  M. Gerstein,et al.  Global analysis of protein phosphorylation in yeast , 2005, Nature.

[39]  A. Bairoch,et al.  Annotating single amino acid polymorphisms in the UniProt/Swiss‐Prot knowledgebase , 2008, Human mutation.

[40]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[41]  René Witte,et al.  Mutation Mining—A Prospector's Tale , 2006, Inf. Syst. Frontiers.

[42]  Fred E. Cohen,et al.  Automatic Extraction of Protein Point Mutations Using a Graph Bigram Association , 2007, PLoS Comput. Biol..

[43]  Osman Ugur Sezerman,et al.  Application of Automatic Mutation-gene Pair Extraction to Diseases , 2007, J. Bioinform. Comput. Biol..