Semi-supervised clinical text classification with Laplacian SVMs: An application to cancer case management

OBJECTIVE To compare linear and Laplacian SVMs on a clinical text classification task; to evaluate the effect of unlabeled training data on Laplacian SVM performance. BACKGROUND The development of machine-learning based clinical text classifiers requires the creation of labeled training data, obtained via manual review by clinicians. Due to the effort and expense involved in labeling data, training data sets in the clinical domain are of limited size. In contrast, electronic medical record (EMR) systems contain hundreds of thousands of unlabeled notes that are not used by supervised machine learning approaches. Semi-supervised learning algorithms use both labeled and unlabeled data to train classifiers, and can outperform their supervised counterparts. METHODS We trained support vector machines (SVMs) and Laplacian SVMs on a training reference standard of 820 abdominal CT, MRI, and ultrasound reports labeled for the presence of potentially malignant liver lesions that require follow up (positive class prevalence 77%). The Laplacian SVM used 19,845 randomly sampled unlabeled notes in addition to the training reference standard. We evaluated SVMs and Laplacian SVMs on a test set of 520 labeled reports. RESULTS The Laplacian SVM trained on labeled and unlabeled radiology reports significantly outperformed supervised SVMs (Macro-F1 0.773 vs. 0.741, Sensitivity 0.943 vs. 0.911, Positive Predictive value 0.877 vs. 0.883). Performance improved with the number of labeled and unlabeled notes used to train the Laplacian SVM (pearson's ρ=0.529 for correlation between number of unlabeled notes and macro-F1 score). These results suggest that practical semi-supervised methods such as the Laplacian SVM can leverage the large, unlabeled corpora that reside within EMRs to improve clinical text classification.

[1]  K. Bretonnel Cohen,et al.  A shared task involving multi-label classification of clinical free text , 2007, BioNLP@ACL.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[4]  Hua Xu,et al.  Applying active learning to assertion classification of concepts in clinical text , 2012, J. Biomed. Informatics.

[5]  I. Kohane Using electronic health records to drive discovery in disease genomics , 2011, Nature Reviews Genetics.

[6]  G. Rong,et al.  Generalized orthogonal locality preserving projections for nonlinear fault detection and diagnosis , 2009 .

[7]  Jun'ichi Tsujii,et al.  Named entity recognition of follow-up and time information in 20 000 radiology reports , 2012, J. Am. Medical Informatics Assoc..

[8]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[9]  Kilian Q. Weinberger,et al.  Spectral Methods for Dimensionality Reduction , 2006, Semi-Supervised Learning.

[10]  Clement J. McDonald,et al.  What can natural language processing do for clinical decision support? , 2009, J. Biomed. Informatics.

[11]  Donna C Vogel,et al.  Using nurse navigation to improve timeliness of lung cancer care at a veterans hospital. , 2012, Clinical journal of oncology nursing.

[12]  Mikhail Belkin,et al.  Beyond the point cloud: from transductive to semi-supervised learning , 2005, ICML.

[13]  Özlem Uzuner,et al.  Viewpoint Paper: Recognizing Obesity and Comorbidities in Sparse Data , 2009, J. Am. Medical Informatics Assoc..

[14]  Carol Friedman,et al.  Extracting Information on Pneumonia in Infants Using Natural Language Processing of Radiology Reports , 2003, BioNLP@ACL.

[15]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[16]  Matthew Scotch,et al.  The Yale cTAKES extensions for document classification: architecture and application , 2011, J. Am. Medical Informatics Assoc..

[17]  Angelique Zeringue,et al.  Increasing prevalence of HCC and cirrhosis in patients with chronic hepatitis C virus infection. , 2011, Gastroenterology.

[18]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[19]  Alexander Zien,et al.  Semi-Supervised Classification by Low Density Separation , 2005, AISTATS.

[20]  Yun Yen,et al.  NCCN clinical practice guidelines in oncology: hepatobiliary cancers. , 2009, Journal of the National Comprehensive Cancer Network : JNCCN.

[21]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[22]  S. Trent Rosenbloom,et al.  NLP-based Identification of Pneumonia Cases from Free-Text Radiological Reports , 2008, AMIA.

[23]  Mikhail Belkin,et al.  Laplacian Support Vector Machines Trained in the Primal , 2009, J. Mach. Learn. Res..

[24]  Thorsten Joachims,et al.  Transductive Learning via Spectral Graph Partitioning , 2003, ICML.

[25]  Traber Davis,et al.  Characteristics and predictors of missed opportunities in lung cancer diagnosis: an electronic health record-based study. , 2010, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[26]  James H Thrall,et al.  Application of Recently Developed Computer Algorithm for Automatic Classification of Unstructured Radiology Reports: Validation Study 1 , 2004 .

[27]  Mikhail Belkin,et al.  Regularization and Semi-supervised Learning on Large Graphs , 2004, COLT.

[28]  J. Lafferty,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[29]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .