Discovering Sublanguages in a Large Clinical Corpus through Unsupervised Machine Learning and Information Gain

Sublanguages are domain-centered subsets of general or colloquial language. Their identification drives several language analysis tasks, but it is difficult to discern separate sublanguages in large clinical corpora. We applied k-means clustering of semantic properties, and a novel implementation of relative entropy as an information gain indicator, to identify sublanguages within a large clinical corpus (~1.6 million documents), visualizing the results in a heat map. Patterns both within and across clusters reveal sublanguage trends. These findings are significant in sublanguage analysis, and have implications on both regional and international levels.

[1]  Judith V. Douglas,et al.  Computerized Large Integrated Health Networks: The VA Sucess , 1997 .

[2]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[3]  Olga Patterson,et al.  Document clustering of clinical narratives: a systematic study of clinical sublanguages. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[4]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[5]  Olivier Bodenreider,et al.  Aggregating UMLS Semantic Types for Reducing Conceptual Complexity , 2001, MedInfo.

[6]  Guy Deville,et al.  Natural Language Modeling in a Machine Translation Prototype for Healthcare Applications: a Sublanguage Approach , 1995, TMI.

[7]  Guy Divita,et al.  Characterizing Clinical Text and Sublanguage: A Case Study of the VA Clinical Notes , 2013 .

[8]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[9]  Adi V. Gundlapalli,et al.  v3NLP Framework: Tools to Build Applications for Extracting Concepts from Clinical Text , 2016, EGEMS.

[10]  Peter L. Elkin Computerizing Large Integrated Health Networks: The VA Success (Computers in Health Care series) , 1998 .

[11]  Olga Patterson,et al.  Document sublanguage clustering to detect medical specialty in cross-institutional clinical texts , 2013, DTMBIO '13.

[12]  Tim Paek,et al.  Sampling representative phrase sets for text entry experiments: a procedure and public resource , 2011, CHI.

[13]  Sanna Salanterä,et al.  Describing the sublanguage of wound care in an adult ICU. , 2012, Studies in health technology and informatics.

[14]  Z. Harris A Theory of Language and Information: A Mathematical Approach , 1991 .

[15]  Ken Chen,et al.  An evaluation of using mutual information for selection of acoustic-features representation of phonemes for speech recognition , 2002, INTERSPEECH.

[16]  C.E. Shannon,et al.  Communication in the Presence of Noise , 1949, Proceedings of the IRE.

[17]  Renato De Mori,et al.  Characterizing Feature Variability in Automatic Speech Recognition Systems , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[18]  Clement J. McDonald,et al.  What can natural language processing do for clinical decision support? , 2009, J. Biomed. Informatics.

[19]  Zellig S. Harris,et al.  The structure of science information , 2002, J. Biomed. Informatics.

[20]  George Hripcsak,et al.  The sublanguage of cross-coverage , 2002, AMIA.

[21]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[22]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[23]  Sunghwan Sohn,et al.  Facilitating post-surgical complication detection through sublanguage analysis , 2014, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[24]  Dean F. Sittig,et al.  Natural language processing in the electronic medical record: assessing clinician adherence to tobacco treatment guidelines. , 2005, American journal of preventive medicine.

[25]  Gregory S. Braswell,et al.  Decreasing variability in the development of graphic production , 2000 .

[26]  Peter J. Haug,et al.  Natural language processing to extract medical problems from electronic clinical documents: Performance evaluation , 2006, J. Biomed. Informatics.

[27]  Huaiyu Zhu On Information and Sufficiency , 1997 .