Identifying the status of genetic lesions in cancer clinical trial documents using machine learning

BackgroundMany cancer clinical trials now specify the particular status of a genetic lesion in a patient's tumor in the inclusion or exclusion criteria for trial enrollment. To facilitate search and identification of gene-associated clinical trials by potential participants and clinicians, it is important to develop automated methods to identify genetic information from narrative trial documents.MethodsWe developed a two-stage classification method to identify genes and genetic lesion statuses in clinical trial documents extracted from the National Cancer Institute's (NCI's) Physician Data Query (PDQ) cancer clinical trial database. The method consists of two steps: 1) to distinguish gene entities from non-gene entities such as English words; and 2) to determine whether and which genetic lesion status is associated with an identified gene entity. We developed and evaluated the performance of the method using a manually annotated data set containing 1,143 instances of the eight most frequently mentioned genes in cancer clinical trials. In addition, we applied the classifier to a real-world task of cancer trial annotation and evaluated its performance using a larger sample size (4,013 instances from 249 distinct human gene symbols detected from 250 trials).ResultsOur evaluation using a manually annotated data set showed that the two-stage classifier outperformed the single-stage classifier and achieved the best average accuracy of 83.7% for the eight most frequently mentioned genes when optimized feature sets were used. It also showed better generalizability when we applied the two-stage classifier trained on one set of genes to another independent gene. When a gene-neutral, two-stage classifier was applied to the real-world task of cancer trial annotation, it achieved a highest accuracy of 89.8%, demonstrating the feasibility of developing a gene-neutral classifier for this task.ConclusionsWe presented a machine learning-based approach to detect gene entities and the genetic lesion statuses from clinical trial documents and demonstrated its use in cancer trial annotation. Such methods would be valuable for building information retrieval tools targeting gene-associated clinical trials.

[1]  O. Griffith,et al.  COSMIC (Catalogue of Somatic Mutations in Cancer) , 2014 .

[2]  A. Hauschild,et al.  Improved survival with vemurafenib in melanoma with BRAF V600E mutation. , 2011, The New England journal of medicine.

[3]  Alexander A. Morgan,et al.  Overview of BioCreAtIvE task 1B: normalized gene lists , 2005, BMC Bioinformatics.

[4]  John G. Cleary,et al.  AZuRE, a scalable system for automated term disambiguation of gene and protein names , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[5]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[6]  Ralf Zimmer,et al.  Gene and protein nomenclature in public databases , 2006, BMC Bioinformatics.

[7]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[8]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[9]  C. Cole,et al.  COSMIC (Catalogue of Somatic Mutations in Cancer) , 2014 .

[10]  J. Clarke,et al.  Medicine , 1907, Bristol medico-chirurgical journal.

[11]  K. Bennett,et al.  A support vector machine approach to decision trees , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[12]  Foster Provost,et al.  Machine Learning from Imbalanced Data Sets 101 , 2008 .

[13]  Alexandra Paillusson,et al.  A GFP-based reporter system to monitor nonsense-mediated mRNA decay , 2005, Nucleic acids research.

[14]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[17]  George Hripcsak,et al.  Gene symbol disambiguation using knowledge-based profiles , 2007, Bioinform..

[18]  Ralf Zimmer,et al.  A simple approach for protein name identification: prospects and limits , 2005, BMC Bioinformatics.

[19]  Richárd Farkas,et al.  The strength of co-authorship in gene name disambiguation , 2008, BMC Bioinformatics.

[20]  Alexander A. Morgan,et al.  BioCreAtIvE Task 1A: gene mention finding evaluation , 2005, BMC Bioinformatics.

[21]  K. Bretonnel Cohen,et al.  BioCreAtIvE Task1A: entity identification with a stochastic tagger , 2005, BMC Bioinformatics.

[22]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[23]  A. Gemma,et al.  F1000 highlights , 2010 .

[24]  Michael J. Lush,et al.  genenames.org: the HGNC resources in 2011 , 2010, Nucleic Acids Res..

[25]  Martijn J. Schuemie,et al.  Thesaurus-based disambiguation of gene symbols , 2005, BMC Bioinformatics.

[26]  Mark Stevenson,et al.  Disambiguation in the biomedical domain: The role of ambiguity type , 2010, J. Biomed. Informatics.

[27]  I. Weinstein Addiction to Oncogenes--the Achilles Heal of Cancer , 2002, Science.