Toward better public health reporting using existing off the shelf approaches: A comparison of alternative cancer detection approaches using plaintext medical data and non-dictionary based feature selection

OBJECTIVES Increased adoption of electronic health records has resulted in increased availability of free text clinical data for secondary use. A variety of approaches to obtain actionable information from unstructured free text data exist. These approaches are resource intensive, inherently complex and rely on structured clinical data and dictionary-based approaches. We sought to evaluate the potential to obtain actionable information from free text pathology reports using routinely available tools and approaches that do not depend on dictionary-based approaches. MATERIALS AND METHODS We obtained pathology reports from a large health information exchange and evaluated the capacity to detect cancer cases from these reports using 3 non-dictionary feature selection approaches, 4 feature subset sizes, and 5 clinical decision models: simple logistic regression, naïve bayes, k-nearest neighbor, random forest, and J48 decision tree. The performance of each decision model was evaluated using sensitivity, specificity, accuracy, positive predictive value, and area under the receiver operating characteristics (ROC) curve. RESULTS Decision models parameterized using automated, informed, and manual feature selection approaches yielded similar results. Furthermore, non-dictionary classification approaches identified cancer cases present in free text reports with evaluation measures approaching and exceeding 80-90% for most metrics. CONCLUSION Our methods are feasible and practical approaches for extracting substantial information value from free text medical data, and the results suggest that these methods can perform on par, if not better, than existing dictionary-based approaches. Given that public health agencies are often under-resourced and lack the technical capacity for more complex methodologies, these results represent potentially significant value to the public health field.

[1]  Guoyin Wang,et al.  Erratum to “Experimental Analyses of the Major Parameters Affecting the Intensity of Outbursts of Coal and Gas” , 2014, The Scientific World Journal.

[2]  E. Pukkala,et al.  DATA QUALITY AND QUALITY CONTROL OF A POPULATION-BASED CANCER REGISTRY , 1994 .

[3]  Y. Zhao,et al.  Comparison of decision tree methods for finding active objects , 2007, 0708.4274.

[4]  J. Marc Overhage,et al.  A comparison of the completeness and timeliness of automated electronic laboratory reporting and spontaneous reporting of notifiable conditions. , 2008, American journal of public health.

[5]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[6]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[7]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[8]  Julie J McGowan,et al.  Electronic laboratory data quality and the value of a health information exchange to support public health reporting processes. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[9]  N. Unnikrishnan Nair,et al.  Kullback–Leibler divergence: A quantile approach , 2016 .

[10]  Michel Verleysen,et al.  K nearest neighbours with mutual information for simultaneous classification and missing data imputation , 2009, Neurocomputing.

[11]  A. Zaslavsky,et al.  Completeness of Information on Adjuvant Therapies for Colorectal Cancer in Population-Based Cancer Registries , 2003, Medical care.

[12]  Daniel J. Vreeman Keeping Up with Changing Source System Terms in a Local Health Information Infrastructure: Running to Stand Still , 2007, MedInfo.

[13]  S. Rosso,et al.  Completeness and timeliness: Cancer registries could/should improve their performance. , 2015, European journal of cancer.

[14]  Shaun J. Grannis,et al.  Evaluating the Accuracy of Automated Notifiable Condition Detection in Free-Text Electronic Laboratory Report Results Using Contemporary Text Mining and Machine Learning Methods , 2015, AMIA.

[15]  D M Parkin,et al.  Cancer incidence and mortality in the European Union: cancer registry data and estimates of national incidence for 1990. , 1997, European journal of cancer.

[16]  Goran Nenadic,et al.  Text mining of cancer-related information: Review of current status and future directions , 2014, Int. J. Medical Informatics.

[17]  Erik M. van Mulligen,et al.  Using an ensemble system to improve concept extraction from clinical records , 2012, J. Biomed. Informatics.

[18]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[19]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[20]  Lucila Ohno-Machado,et al.  Logistic regression and artificial neural network classification models: a methodology review , 2002, J. Biomed. Informatics.

[21]  Xin-She Yang,et al.  Computational Intelligence and Metaheuristic Algorithms with Applications , 2014, TheScientificWorldJournal.

[22]  L. Holmberg,et al.  The completeness of the Swedish Cancer Register – a sample survey for year 1998 , 2009, Acta oncologica.

[23]  Zhaoyang Qu,et al.  Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization , 2014, TheScientificWorldJournal.

[24]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[25]  Lonnie Blevins,et al.  The Indiana network for patient care: a working local health information infrastructure. An example of a working infrastructure collaboration that links data from five health systems and hundreds of millions of entries. , 2005, Health affairs.

[26]  O Bodenreider,et al.  Biomedical ontologies in action: role in knowledge management, data integration and decision support. , 2008, Yearbook of medical informatics.

[27]  M. Pepe,et al.  Comparisons of Predictive Values of Binary Medical Diagnostic Tests for Paired Designs , 2000, Biometrics.

[28]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[29]  C J McDonald,et al.  Electronic laboratory reporting: barriers, solutions and findings. , 2001, Journal of public health management and practice : JPHMP.

[30]  M. Lehtonen,et al.  Data quality and quality control of a population-based cancer registry. Experience in Finland. , 1994, Acta oncologica.

[31]  W Leisenring,et al.  A marginal regression modelling framework for evaluating medical diagnostic tests. , 1997, Statistics in medicine.