论文信息 - Discovering Sublanguages in a Large Clinical Corpus through Unsupervised Machine Learning and Information Gain

Discovering Sublanguages in a Large Clinical Corpus through Unsupervised Machine Learning and Information Gain

Sublanguages are domain-centered subsets of general or colloquial language. Their identification drives several language analysis tasks, but it is difficult to discern separate sublanguages in large clinical corpora. We applied k-means clustering of semantic properties, and a novel implementation of relative entropy as an information gain indicator, to identify sublanguages within a large clinical corpus (~1.6 million documents), visualizing the results in a heat map. Patterns both within and across clusters reveal sublanguage trends. These findings are significant in sublanguage analysis, and have implications on both regional and international levels.

[1] Judith V. Douglas,et al. Computerized Large Integrated Health Networks: The VA Sucess , 1997 .

[2] Andreas Hotho,et al. A Brief Survey of Text Mining , 2005, LDV Forum.

[3] Olga Patterson,et al. Document clustering of clinical narratives: a systematic study of clinical sublanguages. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[4] Carol Friedman,et al. Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[5] Olivier Bodenreider,et al. Aggregating UMLS Semantic Types for Reducing Conceptual Complexity , 2001, MedInfo.

[6] Guy Deville,et al. Natural Language Modeling in a Machine Translation Prototype for Healthcare Applications: a Sublanguage Approach , 1995, TMI.

[7] Guy Divita,et al. Characterizing Clinical Text and Sublanguage: A Case Study of the VA Clinical Notes , 2013 .

[8] D. Lindberg,et al. Unified Medical Language System , 2020, Definitions.

[9] Adi V. Gundlapalli,et al. v3NLP Framework: Tools to Build Applications for Extracting Concepts from Clinical Text , 2016, EGEMS.

[10] Peter L. Elkin. Computerizing Large Integrated Health Networks: The VA Success (Computers in Health Care series) , 1998 .

[11] Olga Patterson,et al. Document sublanguage clustering to detect medical specialty in cross-institutional clinical texts , 2013, DTMBIO '13.

[12] Tim Paek,et al. Sampling representative phrase sets for text entry experiments: a procedure and public resource , 2011, CHI.

[13] Sanna Salanterä,et al. Describing the sublanguage of wound care in an adult ICU. , 2012, Studies in health technology and informatics.

[14] Z. Harris. A Theory of Language and Information: A Mathematical Approach , 1991 .

[15] Ken Chen,et al. An evaluation of using mutual information for selection of acoustic-features representation of phonemes for speech recognition , 2002, INTERSPEECH.

[16] C.E. Shannon,et al. Communication in the Presence of Noise , 1949, Proceedings of the IRE.

[17] Renato De Mori,et al. Characterizing Feature Variability in Automatic Speech Recognition Systems , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[18] Clement J. McDonald,et al. What can natural language processing do for clinical decision support? , 2009, J. Biomed. Informatics.

[19] Zellig S. Harris,et al. The structure of science information , 2002, J. Biomed. Informatics.

[20] George Hripcsak,et al. The sublanguage of cross-coverage , 2002, AMIA.

[21] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[22] Naftali Tishby,et al. Distributional Clustering of English Words , 1993, ACL.

[23] Sunghwan Sohn,et al. Facilitating post-surgical complication detection through sublanguage analysis , 2014, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[24] Dean F. Sittig,et al. Natural language processing in the electronic medical record: assessing clinician adherence to tobacco treatment guidelines. , 2005, American journal of preventive medicine.

[25] Gregory S. Braswell,et al. Decreasing variability in the development of graphic production , 2000 .

[26] Peter J. Haug,et al. Natural language processing to extract medical problems from electronic clinical documents: Performance evaluation , 2006, J. Biomed. Informatics.

[27] Huaiyu Zhu. On Information and Sufficiency , 1997 .