Supervised embedding of textual predictors with applications in clinical diagnostics for pediatric cardiology.

OBJECTIVE Electronic health records possess critical predictive information for machine-learning-based diagnostic aids. However, many traditional machine learning methods fail to simultaneously integrate textual data into the prediction process because of its high dimensionality. In this paper, we present a supervised method using Laplacian Eigenmaps to enable existing machine learning methods to estimate both low-dimensional representations of textual data and accurate predictors based on these low-dimensional representations at the same time. MATERIALS AND METHODS We present a supervised Laplacian Eigenmap method to enhance predictive models by embedding textual predictors into a low-dimensional latent space, which preserves the local similarities among textual data in high-dimensional space. The proposed implementation performs alternating optimization using gradient descent. For the evaluation, we applied our method to over 2000 patient records from a large single-center pediatric cardiology practice to predict if patients were diagnosed with cardiac disease. In our experiments, we consider relatively short textual descriptions because of data availability. We compared our method with latent semantic indexing, latent Dirichlet allocation, and local Fisher discriminant analysis. The results were assessed using four metrics: the area under the receiver operating characteristic curve (AUC), Matthews correlation coefficient (MCC), specificity, and sensitivity. RESULTS AND DISCUSSION The results indicate that supervised Laplacian Eigenmaps was the highest performing method in our study, achieving 0.782 and 0.374 for AUC and MCC, respectively. Supervised Laplacian Eigenmaps showed an increase of 8.16% in AUC and 20.6% in MCC over the baseline that excluded textual data and a 2.69% and 5.35% increase in AUC and MCC, respectively, over unsupervised Laplacian Eigenmaps. CONCLUSIONS As a solution, we present a supervised Laplacian Eigenmap method to embed textual predictors into a low-dimensional Euclidean space. This method allows many existing machine learning predictors to effectively and efficiently capture the potential of textual predictors, especially those based on short texts.

[1]  C. K. Lynn,et al.  Cardiologist versus internist management of patients with unstable angina: treatment patterns and outcomes. , 1996, Journal of the American College of Cardiology.

[2]  Hongyuan Zha,et al.  Utility of a Clinical Support Tool for Outpatient Evaluation of Pediatric Chest Pain , 2012, AMIA.

[3]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[4]  Omolola Ogunyemi,et al.  A comparison of methods for assessing penetrating trauma on retrospective multi-center data , 2009, J. Biomed. Informatics.

[5]  K. Rosbe,et al.  Usefulness of Patient Symptoms and Nasal Endoscopy in the Diagnosis of Chronic Sinusitis , 1998, American journal of rhinology.

[6]  Hongyuan Zha,et al.  On Updating Problems in Latent Semantic Indexing , 1997, SIAM J. Sci. Comput..

[7]  L. Trefethen,et al.  Numerical linear algebra , 1997 .

[8]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models , 2012, J. Mach. Learn. Res..

[9]  Madhu Mazumdar,et al.  Utilization of Critical Care Services among Patients Undergoing Total Hip and Knee Arthroplasty: Epidemiology and Risk Factors , 2012, Anesthesiology.

[10]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[11]  Lawrence D. Jackel,et al.  Limits on Learning Machine Accuracy Imposed by Data Quality , 1995, KDD.

[12]  Richard F. Davies,et al.  Extracting Information for Generating A Diabetes Report Card from Free Text in Physicians Notes , 2010, Louhi@NAACL-HLT.

[13]  Faiza Khan Khattak,et al.  Diving into a Large Corpus of Pediatric Notes , 2013 .

[14]  Martin O. Leach,et al.  The UK MARIBS Breast Screening Study: Evaluation of radiological features for breast tumour classification in clinical screening with machine learning methods , 2005, Artif. Intell. Medicine.

[15]  Hongyuan Zha,et al.  Supervised Laplacian Eigenmaps with Applications in Clinical Diagnostics for Pediatric Cardiology , 2012, ArXiv.

[16]  L. Casalino,et al.  Primary care physicians should be coordinators, not gatekeepers. , 1999, JAMA.

[17]  Masashi Sugiyama,et al.  Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis , 2007, J. Mach. Learn. Res..

[18]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[19]  Yoni Halpern A Comparison of Dimensionality Reduction Techniques for Unstructured Clinical Text , 2012 .

[20]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[21]  Frank D. Wood,et al.  Hierarchically Supervised Latent Dirichlet Allocation , 2011, NIPS.

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[23]  Igor Kononenko,et al.  Analysing and improving the diagnosis of ischaemic heart disease with machine learning , 1999, Artif. Intell. Medicine.

[24]  Hercules Dalianis Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents , 2010 .

[25]  Masashi Sugiyama,et al.  Local Fisher discriminant analysis for supervised dimensionality reduction , 2006, ICML.

[26]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[27]  Blaise Hanczar,et al.  Small-sample precision of ROC-related estimates , 2010, Bioinform..

[28]  Alexander Turchin,et al.  Identification of Documented Medication Non-Adherence in Physician Notes , 2008, AMIA.