Automatic semantic classification of scientific literature according to the hallmarks of cancer

MOTIVATION The hallmarks of cancer have become highly influential in cancer research. They reduce the complexity of cancer into 10 principles (e.g. resisting cell death and sustaining proliferative signaling) that explain the biological capabilities acquired during the development of human tumors. Since new research depends crucially on existing knowledge, technology for semantic classification of scientific literature according to the hallmarks of cancer could greatly support literature review, knowledge discovery and applications in cancer research. RESULTS We present the first step toward the development of such technology. We introduce a corpus of 1499 PubMed abstracts annotated according to the scientific evidence they provide for the 10 currently known hallmarks of cancer. We use this corpus to train a system that classifies PubMed literature according to the hallmarks. The system uses supervised machine learning and rich features largely based on biomedical text mining. We report good performance in both intrinsic and extrinsic evaluations, demonstrating both the accuracy of the methodology and its potential in supporting practical cancer research. We discuss how this approach could be developed and applied further in the future. AVAILABILITY AND IMPLEMENTATION The corpus of hallmark-annotated PubMed abstracts and the software for classification are available at: http://www.cl.cam.ac.uk/∼sb895/HoC.html. CONTACT simon.baker@cl.cam.ac.uk.

[1]  I. Barasoain,et al.  Taxanes: microtubule and centrosome targets, and cell cycle dependent mechanisms of action. , 2003, Current cancer drug targets.

[2]  Anna Korhonen,et al.  CRAB Reader: A Tool for Analysis and Visualization of Argumentative Zones in Scientific Literature , 2012, COLING.

[3]  Sampo Pyysalo,et al.  Overview of the Cancer Genetics (CG) task of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[4]  Hoifung Poon,et al.  Joint Inference for Knowledge Extraction from Biomedical Literature , 2010, NAACL.

[5]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[6]  G. Pihan,et al.  The mitotic machinery as a source of genetic instability in cancer. , 1999, Seminars in cancer biology.

[7]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[8]  E S McCrea,et al.  Metastatic Basal Cell Carcinoma , 1983, Southern medical journal.

[9]  Goran Nenadic,et al.  Text mining of cancer-related information: Review of current status and future directions , 2014, Int. J. Medical Informatics.

[10]  Cheng Zhang,et al.  Biomedical text mining and its applications in cancer research , 2013, J. Biomed. Informatics.

[11]  Anna Korhonen,et al.  Improving Verb Clustering with Automatically Acquired Selectional Preferences , 2009, EMNLP.

[12]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[13]  B. Stewart,et al.  World Cancer Report , 2003 .

[14]  Dina Demner-Fushman,et al.  Biomedical Text Mining: A Survey of Recent Progress , 2012, Mining Text Data.

[15]  M. Roizen,et al.  Hallmarks of Cancer: The Next Generation , 2012 .

[16]  Stephen Clark,et al.  Porting a lexicalized-grammar parser to the biomedical domain , 2009, J. Biomed. Informatics.

[17]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[18]  Dan Xia,et al.  Learning classifier system with average reward reinforcement learning , 2013, Knowl. Based Syst..

[19]  Hagit Shatkay,et al.  Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users , 2008, Bioinform..

[20]  A Valencia,et al.  An Overview of BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  F. Rudzicz Human Language Technologies : The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , 2010 .

[22]  K. Polyak,et al.  Intra-tumour heterogeneity: a looking glass for cancer? , 2012, Nature Reviews Cancer.

[23]  P. Schiff,et al.  Taxol stabilizes microtubules in mouse fibroblast cells. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[24]  B. Stewart,et al.  World cancer report 2014. , 2014 .

[25]  S. Rashid,et al.  Hallmarks of Cancer Cell , 2017 .

[26]  D. Hanahan,et al.  The Hallmarks of Cancer , 2000, Cell.

[27]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[28]  M. Dolan,et al.  Capturing cancer initiating events in OncoCL, a cancer cell ontology , 2014, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[29]  S. Gunn Support Vector Machines for Classification and Regression , 1998 .

[30]  Karin M. Verspoor,et al.  BioLemmatizer: a lemmatization tool for morphological processing of biomedical text , 2012, J. Biomed. Semant..

[31]  B. Vogelstein,et al.  Variation in cancer risk among tissues can be explained by the number of stem cell divisions , 2015, Science.

[32]  S. Wilhelm,et al.  Discovery and development of sorafenib: a multikinase inhibitor for treating cancer , 2007, Nature Reviews Drug Discovery.

[33]  Stephen Clark,et al.  Supertagging for Combinatory Categorial Grammar , 2002, TAG+.

[34]  Michael A. Shepherd,et al.  Support vector machines for text categorization , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[35]  Fiddler Melanoma Metastasis. , 1995, Cancer control : journal of the Moffitt Cancer Center.

[36]  A. Korhonen,et al.  Text Mining for Literature Review and Knowledge Discovery in Cancer Risk Assessment and Research , 2012, PloS one.

[37]  J. Listgarten,et al.  Evidence that dysregulated DNA mismatch repair characterizes human nonmelanoma skin cancer , 2007, The British journal of dermatology.

[38]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[39]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[40]  Jong C. Park,et al.  OncoSearch: cancer gene search engine with literature evidence , 2014, Nucleic Acids Res..