Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes

PURPOSE The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools that help to monitor and prioritize the literature to understand the clinical implications of pathogenic genetic variants. We developed and evaluated two machine learning models to classify abstracts as relevant to the penetrance-risk of cancer for germline mutation carriers-or prevalence of germline genetic mutations. MATERIALS AND METHODS We conducted literature searches in PubMed and retrieved paper titles and abstracts to create an annotated data set for training and evaluating the two machine learning classification models. Our first model is a support vector machine (SVM) which learns a linear decision rule on the basis of the bag-of-ngrams representation of each title and abstract. Our second model is a convolutional neural network (CNN) which learns a complex nonlinear decision rule on the basis of the raw title and abstract. We evaluated the performance of the two models on the classification of papers as relevant to penetrance or prevalence. RESULTS For penetrance classification, we annotated 3,740 paper titles and abstracts and evaluated the two models using 10-fold cross-validation. The SVM model achieved 88.93% accuracy-percentage of papers that were correctly classified-whereas the CNN model achieved 88.53% accuracy. For prevalence classification, we annotated 3,753 paper titles and abstracts. The SVM model achieved 88.92% accuracy and the CNN model achieved 88.52% accuracy. CONCLUSION Our models achieve high accuracy in classifying abstracts as relevant to penetrance or prevalence. By facilitating literature review, this tool could help clinicians and researchers keep abreast of the burgeoning knowledge of gene-cancer associations and keep the knowledge bases for clinical decision support tools up to date.

[1]  Scott R. Halgrim,et al.  Using natural language processing to improve efficiency of manual chart abstraction in research: the case of breast cancer recurrence. , 2014, American journal of epidemiology.

[2]  Regina Barzilay,et al.  Deriving Machine Attention from Human Rationales , 2018, EMNLP.

[3]  Joe Kesterson,et al.  Comparing methods for identifying pancreatic cancer patients using electronic data sources. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[4]  William R. Hersh,et al.  Reducing workload in systematic review preparation using automated citation classification. , 2006, Journal of the American Medical Informatics Association : JAMIA.

[5]  Fernanda Polubriaginof,et al.  The feasibility of using natural language processing to extract clinical information from breast pathology reports , 2012, Journal of pathology informatics.

[6]  Wendy Filsell,et al.  What the papers say: Text mining for genomics and systems biology , 2010, Human Genomics.

[7]  Stan Matwin,et al.  Exploiting the systematic review protocol for classification of medical abstracts , 2011, Artif. Intell. Medicine.

[8]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[9]  Siddhartha Jonnalagadda,et al.  A new iterative method to reduce workload in systematic review process , 2013, Int. J. Comput. Biol. Drug Des..

[10]  Yindalon Aphinyanagphongs,et al.  Research Paper: Text Categorization Models for High-Quality Article Retrieval in Internal Medicine , 2004, J. Am. Medical Informatics Assoc..

[11]  Danielle Braun,et al.  A Clinical Decision Support Tool to Predict Cancer Risk for Commonly Tested Cancer-Related Germline Mutations , 2018, Journal of Genetic Counseling.

[12]  Regina Barzilay,et al.  Validation of a Semiautomated Natural Language Processing-Based Procedure for Meta-Analysis of Cancer Susceptibility Gene Penetrance. , 2019, JCO clinical cancer informatics.

[13]  Christine D. Piatko,et al.  Using “Annotator Rationales” to Improve Machine Learning for Text Categorization , 2007, NAACL.

[14]  Andrés Montoyo,et al.  Advances on natural language processing , 2007, Data Knowl. Eng..

[15]  Steven H. Brown,et al.  Automated identification of postoperative complications within an electronic medical record using natural language processing. , 2011, JAMA.

[16]  Stan Matwin,et al.  A new algorithm for reducing the workload of experts in performing systematic reviews , 2010, J. Am. Medical Informatics Assoc..

[17]  Merlijn Sevenster,et al.  A natural language processing pipeline for pairing measurements uniquely across free-text CT reports , 2015, J. Biomed. Informatics.

[18]  A Burgun,et al.  Automated Classification of Free-text Pathology Reports for Registration of Incident Cases of Cancer , 2011, Methods of Information in Medicine.

[19]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[20]  Alan Ritter,et al.  Using ontology-based semantic similarity to facilitate the article screening process for systematic reviews , 2017, J. Biomed. Informatics.

[21]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[22]  Sophia Ananiadou,et al.  Reducing systematic review workload through certainty-based screening , 2014, J. Biomed. Informatics.

[23]  Halil Kilicoglu,et al.  Combining Relevance Assignment with Quality of the Evidence to Support Guideline Development , 2010, MedInfo.

[24]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[25]  Carla E. Brodley,et al.  Semi-automated screening of biomedical citations for systematic reviews , 2010, BMC Bioinformatics.

[26]  Ye Zhang,et al.  Rationale-Augmented Convolutional Neural Networks for Text Classification , 2016, EMNLP.

[27]  Wen-wai Yim,et al.  Natural Language Processing in Oncology: A Review. , 2016, JAMA oncology.