Analysis of Primary Care Computerized Medical Records (CMR) Data With Deep Autoencoders (DAE)

The use of deep learning is becoming increasingly important in the analysis of medical data for applications such as pattern recognition for classification. The use of primary healthcare computational medical records (CMR) data is vital in prediction of infection prevalence across a population and decision making at a national scale. However, to date, the application of machine learning algorithms to CMR data remains under-utilised despite the potential impact for use in diagnostics or prevention of epidemics such as outbreaks of influenza. A particular challenge in epidemiology is how to differentiate incident cases from those that are follow-ups for the same condition. Furthermore, the CMR data are typically heterogeneous, noisy, high dimensional and incomplete making automated analysis difficult. Here we introduce a methodology for converting heterogeneous data such that it is compatible with a deep autoencoder for reduction of CMR data. This approach provides a tool for real time visualisation of these high dimensional data, revealing previously unknown dependencies and clusters. Our unsupervised nonlinear reduction method can be used to identify the features driving the formation of these clusters that can aid decision making in healthcare applications. The results in this work demonstrate that our methods can cluster more than 97.84\% of the data (clusters $>$5 points) each of which is uniquely described by three attributes in the data, Clinical System (CMR system), Read Code (as recorded) and Read Term (standardised coding). Further, we propose the use of Shannon Entropy as a means to analyse the dispersion of clusters and the contribution from the underlying attributes to gain further insight from the data. Our results demonstrate that Shannon Entropy is a useful metric for analysing both the low dimensional clusters of CMR data and also the features in the original heterogeneous data. Finally, we find that the entropy of the low dimensional clusters are directly representative of the entropy of the input data (Pearson Correlation = 0.99, R$^2$ = 0.98) and therefore the reduced data from the deep autoencoder is reflective of the original CMR data variability.

[1]  P. Böelle,et al.  Improving disease incidence estimates in primary care surveillance systems , 2014, Population Health Metrics.

[2]  Andrew Y. Ng,et al.  Improving palliative care with deep learning , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[3]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[4]  Allan R. Jones,et al.  Shared and distinct transcriptomic cell types across neocortical areas , 2018, Nature.

[5]  Filipa Ferreira,et al.  RCGP Research and Surveillance Centre: 50 years' surveillance of influenza, infections, and respiratory conditions. , 2017, The British journal of general practice : the journal of the Royal College of General Practitioners.

[6]  D. Fleming,et al.  Ten lessons for the next influenza pandemic—an English perspective , 2012, Human vaccines & immunotherapeutics.

[7]  Laurens van der Maaten,et al.  Learning a Parametric Embedding by Preserving Local Structure , 2009, AISTATS.

[8]  Martin Wattenberg,et al.  How to Use t-SNE Effectively , 2016 .

[9]  Spencer A. Thomas,et al.  Dimensionality reduction of mass spectrometry imaging data using autoencoders , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[10]  Brett K. Beaulieu-Jones,et al.  An Unsupervised Homogenization Pipeline for Clustering Similar Patients using Electronic Health Record Data , 2017, 1801.00065.

[11]  Yaochu Jin,et al.  Reconstructing biological gene regulatory networks: where optimization meets big data , 2014, Evol. Intell..

[12]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[13]  W. Guan,et al.  Unsupervised learning technique identifies bronchiectasis phenotypes with distinct clinical characteristics. , 2016, The international journal of tuberculosis and lung disease : the official journal of the International Union against Tuberculosis and Lung Disease.

[14]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[15]  Simon de Lusignan,et al.  Royal College of General Practitioners Research and Surveillance Centre (RCGP RSC) sentinel network: a cohort profile , 2016, BMJ Open.

[16]  David J. B. Lloyd,et al.  Equation-free analysis of agent-based models and systematic parameter determination , 2016 .

[17]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[18]  S. de Lusignan,et al.  The GP's role in promoting winter wellness. , 2017, The British journal of general practice : the journal of the Royal College of General Practitioners.

[19]  Luis Pizarro,et al.  Hyperspectral visualization of mass spectrometry imaging data. , 2013, Analytical chemistry.

[20]  S. de Lusignan,et al.  Significant spike in excess mortality in England in winter 2014/15 – influenza the likely culprit , 2018, Epidemiology and Infection.

[21]  Romina Martin,et al.  Analyzing regime shifts in agent-based models with equation-free analysis , 2016 .

[22]  Ahmed Mahfouz,et al.  Visualizing the spatial gene expression organization in the brain through non-linear similarity embeddings. , 2015, Methods.

[23]  Yaochu Jin,et al.  Enhancing classification of mass spectrometry imaging data with deep neural networks , 2017, 2017 IEEE Symposium Series on Computational Intelligence (SSCI).

[24]  Martin Fodslette Meiller A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning , 1993 .

[25]  C. Weel,et al.  The use of routinely collected computer data for research in primary care: opportunities and challenges. , 2006, Family practice.

[26]  George W. Irwin,et al.  Improving neural network training solutions using regularisation , 2001, Neurocomputing.

[27]  Simon de Lusignan,et al.  Automated Differentiation of Incident and Prevalent Cases in Primary Care Computerised Medical Records (CMR) , 2018, MIE.

[28]  Giovanni Samaey,et al.  Equation-free multiscale computation: algorithms and applications. , 2009, Annual review of physical chemistry.

[29]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[30]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[31]  D. Fleming,et al.  Lessons from 40 years' surveillance of influenza in England and Wales , 2007, Epidemiology and Infection.

[32]  M.H. Hassoun,et al.  Fundamentals of Artificial Neural Networks , 1996, Proceedings of the IEEE.

[33]  D. Fleming,et al.  Health monitoring in sentinel practice networks: the contribution of primary care. , 2003, European journal of public health.

[34]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .