KinDER : A Biocuration Tool for Extracting Kinase Knowledge from Biomedical Literature

Kinases are enzymes that mediate phosphate transfer. Extracting information on kinases from biomedical literature is an important task which has direct implications for applications such as drug design. In this work, we develop KinDER, Kinase Document Extractor and Ranker, a biomedical natural language processing tool for extracting functional and disease related information on kinases. This tool combines information retrieval and machine learning techniques to automatically extract information about protein kinases. First, it uses several bio-ontologies to retrieve documents related to kinases and then uses a supervised classification model to rank them according to their relevance. This was developed to participate in the Text-mining services for Human Kinome Curation Track of the BioCreative VI challenge. According to the official BioCreative evaluation results, KinDER provides stateof-the-art performance for extracting functional information on kinases from abstracts. Keywords—kinase; proteins; machine learning; biomedical natural language processing; BioCreative; text classification; supervised learning

[1]  Ian H. Witten,et al.  Data Mining: Practical Machine Learning Tools and Techniques, 3/E , 2014 .

[2]  Karin M. Verspoor,et al.  Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct , 2015, J. Biomed. Semant..

[3]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[4]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[5]  Sherri de Coronado,et al.  NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information , 2007, J. Biomed. Informatics.

[6]  Barry Smith,et al.  Infectious Disease Ontology , 2010 .

[7]  K. Bretonnel Cohen,et al.  Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters , 2014, BMC Bioinformatics.

[8]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[9]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[10]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..

[11]  Cui Tao,et al.  OAE: The Ontology of Adverse Events , 2014, J. Biomed. Semant..

[12]  N. Shah,et al.  NCBO Annotator: Semantic Annotation of Biomedical Data , 2009 .

[13]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[14]  Anni Coden,et al.  The ConceptMapper Approach to Named Entity Recognition , 2010, LREC.

[15]  P. Robinson,et al.  The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. , 2008, American journal of human genetics.

[16]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[17]  Viv Bewick,et al.  Statistics review 13: Receiver operating characteristic curves , 2004, Critical care.