Taming EHR data: Using Semantic Similarity to reduce Dimensionality

Medical care data is a valuable resource that can be used for many purposes including managing and planning for future health needs as well as clinical research. However, the heterogeneity and complexity of medical data can be an obstacle in applying data mining techniques. Much of the potential value of this data therefore goes untapped. In this paper we have developed a methodology that reduces the dimensionality of primary care data, in order to make it more amenable to visualisation, mining and clustering. The methodology involves employing a combination of ontology-based semantic similarity and principal component analysis (PCA) to map the data into an appropriate and informative low dimensional space. Throughout the study, we had access to anonymised patient data from primary care in Salford, UK. The results of our application of this methodology show that diagnosis codes in primary care data can be used to map patients into an informative low dimensional space, which in turn provides the opportunity to support further data exploration and medical hypothesis formulation.

[1]  R. Lawrenson,et al.  Clinical information for research; the use of general practice databases. , 1999, Journal of public health medicine.

[2]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[3]  David Sánchez,et al.  Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective , 2011, J. Biomed. Informatics.

[4]  J. Chisholm,et al.  The Read clinical classification. , 1990, BMJ.

[5]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[6]  Eirini Ntoutsi The Notion of Similarity in Data and Pattern Spaces , 2004, PaRMa.

[7]  Søren Brunak,et al.  Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts , 2011, PLoS Comput. Biol..

[8]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[9]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[10]  Peter Libby,et al.  Diabetes and atherosclerosis: epidemiology, pathophysiology, and management. , 2002, JAMA.

[11]  Carole A. Goble,et al.  Semantic Similarity Measures as Tools for Exploring the Gene Ontology , 2002, Pacific Symposium on Biocomputing.

[12]  R. Duin,et al.  Automatic pattern recognition by similarity representations , 2001 .

[13]  Steffen Staab,et al.  Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis , 2005, J. Artif. Intell. Res..

[14]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[15]  Robert P. W. Duin Relational discriminant analysis and its large sample size problem , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[16]  Kathy Giannangelo Making the Connection between Standard Terminologies, Use Cases, and Mapping , 2006, The HIM journal.

[17]  Cai Wu,et al.  EM Clustering Analysis of Diabetes Patients Basic Diagnosis Index , 2005, AMIA.