From Text to Topics in Healthcare Records: An Unsupervised Graph Partitioning Methodology

Electronic Healthcare Records contain large volumes of unstructured data, including extensive free text. Yet this source of detailed information often remains under-used because of a lack of methodologies to extract interpretable content in a timely manner. Here we apply network-theoretical tools to analyse free text in Hospital Patient Incident reports from the National Health Service, to find clusters of documents with similar content in an unsupervised manner at different levels of resolution. We combine deep neural network paragraph vector text-embedding with multiscale Markov Stability community detection applied to a sparsified similarity graph of document vectors, and showcase the approach on incident reports from Imperial College Healthcare NHS Trust, London. The multiscale community structure reveals different levels of meaning in the topics of the dataset, as shown by descriptive terms extracted from the clusters of records. We also compare a posteriori against hand-coded categories assigned by healthcare personnel, and show that our approach outperforms LDA-based models. Our content clusters exhibit good correspondence with two levels of hand-coded categories, yet they also provide further medical detail in certain areas and reveal complementary descriptors of incidents beyond the external classification taxonomy.

[1]  Sophia Ananiadou,et al.  Topic detection using paragraph vectors to support active learning in systematic reviews , 2016, J. Biomed. Informatics.

[2]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[3]  Mauricio Barahona,et al.  Flow-Based Network Analysis of the Caenorhabditis elegans Connectome , 2015, PLoS Comput. Biol..

[4]  Konrad P. Körding,et al.  A high-reproducibility and high-accuracy method for automated topic classification , 2014, ArXiv.

[5]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[6]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[7]  Colin Cooper,et al.  Spectral clustering using the kNN-MST similarity graph , 2016, 2016 8th Computer Science and Electronic Engineering (CEEC).

[8]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[9]  J. V. White,et al.  Performance Metrics for Group-Detection Algorithms , 2004 .

[10]  Jean-Charles Delvenne,et al.  Markov Dynamics as a Zooming Lens for Multiscale Community Detection: Non Clique-Like Communities and the Field-of-View Limit , 2011, PloS one.

[11]  Jean-Charles Delvenne,et al.  Stability of graph communities across time scales , 2008, Proceedings of the National Academy of Sciences.

[12]  Jean-Charles Delvenne,et al.  Random Walks, Markov Processes and the Multiscale Modular Organization of Complex Networks , 2014, IEEE Transactions on Network Science and Engineering.