PubMed-supported clinical term weighting approach for improving inter-patient similarity measure in diagnosis prediction

BackgroundSimilarity-based retrieval of Electronic Health Records (EHRs) from large clinical information systems provides physicians the evidence support in making diagnoses or referring examinations for the suspected cases. Clinical Terms in EHRs represent high-level conceptual information and the similarity measure established based on these terms reflects the chance of inter-patient disease co-occurrence. The assumption that clinical terms are equally relevant to a disease is unrealistic, reducing the prediction accuracy. Here we propose a term weighting approach supported by PubMed search engine to address this issue.MethodsWe collected and studied 112 abdominal computed tomography imaging examination reports from four hospitals in Hong Kong. Clinical terms, which are the image findings related to hepatocellular carcinoma (HCC), were extracted from the reports. Through two systematic PubMed search methods, the generic and specific term weightings were established by estimating the conditional probabilities of clinical terms given HCC. Each report was characterized by an ontological feature vector and there were totally 6216 vector pairs. We optimized the modified direction cosine (mDC) with respect to a regularization constant embedded into the feature vector. Equal, generic and specific term weighting approaches were applied to measure the similarity of each pair and their performances for predicting inter-patient co-occurrence of HCC diagnoses were compared by using Receiver Operating Characteristics (ROC) analysis.ResultsThe Areas under the curves (AUROCs) of similarity scores based on equal, generic and specific term weighting approaches were 0.735, 0.728 and 0.743 respectively (p < 0.01). In comparison with equal term weighting, the performance was significantly improved by specific term weighting (p < 0.01) but not by generic term weighting. The clinical terms “Dysplastic nodule”, “nodule of liver” and “equal density (isodense) lesion” were found the top three image findings associated with HCC in PubMed.ConclusionsOur findings suggest that the optimized similarity measure with specific term weighting to EHRs can improve significantly the accuracy for predicting the inter-patient co-occurrence of diagnosis when compared with equal and generic term weighting approaches.

[1]  Timothy M Pawlik,et al.  Hepatocellular carcinoma: diagnosis, management, and prognosis. , 2014, Surgical oncology clinics of North America.

[2]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[3]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[4]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[5]  David Sánchez,et al.  Ontology-based semantic similarity: A new feature-based approach , 2012, Expert Syst. Appl..

[6]  Chi-Ren Shyu,et al.  A SNOMED supported ontological vector model for subclinical disorder detection using EHR similarity , 2011, Eng. Appl. Artif. Intell..

[7]  Werner Ceusters,et al.  Strategies for referent tracking in electronic health records , 2006, J. Biomed. Informatics.

[8]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[9]  Jeffrey P. Krischer,et al.  Use of SNOMED CT to represent clinical research data: a semantic characterization of data items on case report forms in vasculitis research. , 2006, Journal of the American Medical Informatics Association : JAMIA.

[10]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[11]  R. Sharan,et al.  A method for inferring medical diagnoses from patient similarities , 2013, BMC Medicine.

[12]  George Hripcsak,et al.  Inter-patient distance metrics using SNOMED CT defining relationships , 2006, J. Biomed. Informatics.

[13]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[14]  Lawrence W.C. Chan,et al.  Is the inter-patient coincidence of a subclinical disorder related to EHR similarity? , 2011, 2011 IEEE 13th International Conference on e-Health Networking, Applications and Services.

[15]  Michiie Sakamoto,et al.  Early HCC: diagnosis and molecular markers , 2008, Journal of Gastroenterology.

[16]  Jerome Wang,et al.  An Applied Evaluation of SNOMED CT as a Clinical Vocabulary for the Computerized Diagnosis and Problem List , 2003, AMIA.

[17]  David Sánchez,et al.  An ontology-based measure to compute semantic similarity in biomedicine , 2011, J. Biomed. Informatics.

[18]  Eleni Liapi,et al.  Multidetector CT of hepatocellular carcinoma. , 2005, Best practice & research. Clinical gastroenterology.

[19]  Junzhong Gu,et al.  A New Model of Information Content for Semantic Similarity in WordNet , 2008, 2008 Second International Conference on Future Generation Communication and Networking Symposia.

[20]  Kent A. Spackman,et al.  The Use of SNOMED© CT Simplifies Querying of a Clinical Data Warehouse , 2003, AMIA.

[21]  Paolo Fontana,et al.  Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms , 2012, BMC Bioinformatics.