Cancer Hallmark Text Classification Using Convolutional Neural Networks

Methods based on deep learning approaches have recently achieved state-of-the-art performance in a range of machine learning tasks and are increasingly applied to natural language processing (NLP). Despite strong results in various established NLP tasks involving general domain texts, there is only limited work applying these models to biomedical NLP. In this paper, we consider a Convolutional Neural Network (CNN) approach to biomedical text classification. Evaluation using a recently introduced cancer domain dataset involving the categorization of documents according to the well-established hallmarks of cancer shows that a basic CNN model can achieve a level of performance competitive with a Support Vector Machine (SVM) trained using complex manually engineered features optimized to the task. We further show that simple modifications to the CNN hyperparameters, initialization, and training process allow the model to notably outperform the SVM, establishing a new state of the art result at this task. We make all of the resources and tools introduced in this study available under open licenses from https://cambridgeltl.github.io/cancer-hallmark-cnn/.

[1]  Anna Korhonen,et al.  Improving Verb Clustering with Automatically Acquired Selectional Preferences , 2009, EMNLP.

[2]  D. Hanahan,et al.  The Hallmarks of Cancer , 2000, Cell.

[3]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[4]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[5]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[6]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[7]  Karin M. Verspoor,et al.  BioLemmatizer: a lemmatization tool for morphological processing of biomedical text , 2012, J. Biomed. Semant..

[8]  S. Rashid,et al.  Hallmarks of Cancer Cell , 2017 .

[9]  M. Dolan,et al.  Capturing cancer initiating events in OncoCL, a cancer cell ontology , 2014, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[10]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[11]  Sampo Pyysalo,et al.  Size (and Domain) Matters: Evaluating Semantic Word Space Representations for Biomedical Text , 2012 .

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[14]  Sampo Pyysalo,et al.  Overview of the Cancer Genetics (CG) task of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[15]  Anna Korhonen,et al.  Automatic semantic classification of scientific literature according to the hallmarks of cancer , 2016, Bioinform..

[16]  Nigel Collier,et al.  Modelling the Combination of Generic and Target Domain Embeddings in a Convolutional Neural Network for Sentence Classification , 2016, BioNLP@ACL.

[17]  Alessandro Moschitti,et al.  UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification , 2015, *SEMEVAL.

[18]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[19]  Sampo Pyysalo,et al.  How to Train good Word Embeddings for Biomedical NLP , 2016, BioNLP@ACL.

[20]  Ye Zhang,et al.  A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification , 2015, IJCNLP.

[21]  Azadeh Nikfarjam,et al.  Mining Twitter for Adverse Drug Reaction Mentions : A Corpus and Classification Benchmark , 2014 .

[22]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[23]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[24]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25]  Jong C. Park,et al.  OncoSearch: cancer gene search engine with literature evidence , 2014, Nucleic Acids Res..

[26]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[27]  Stephen Clark,et al.  Porting a lexicalized-grammar parser to the biomedical domain , 2009, J. Biomed. Informatics.