Lexical Characteristics Analysis of Chinese Clinical Documents

Understanding lexical characteristics of clinical documents is the foundation of sublanguage based Medical Language Processing (MLP) approach. However, there are limited studies focused on the lexical characters of Chinese clinical documents. In this study, a lexical characteristics analysis on both syntactic and semantic levels was conducted in a clinical corpus which contains 3,500 clinical documents generated during daily practices. The analysis was based on the automatic tagging results of a lexicon-based part-of-speech (POS) and semantic tagging method. The medical lexicon contains 237,291 entries annotated with both semantic and syntactic classes. The normalized frequency of different terms, syntactic and semantic classes was calculated and visualized. Major contribution of this paper is providing a wide-coverage Chinese medical semantic lexicon and presenting the lexical characteristics of Chinese clinical documents. Both of these will lay a good foundation for sublanguage based MLP studies in China.

[1]  Hongfang Liu,et al.  Using Discharge Summaries to Improve Information Retrieval in Clinical Domain , 2013, CLEF.

[2]  Hongfang Liu,et al.  Semantic characteristics of NLP-extracted concepts in clinical notes vs. biomedical literature. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[3]  G Hripcsak,et al.  Natural language processing and its future in medicine. , 1999, Academic medicine : journal of the Association of American Medical Colleges.

[4]  Stéphane M. Meystre,et al.  Text de-identification for privacy protection: A study of its impact on clinical text information content , 2014, J. Biomed. Informatics.

[5]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[6]  Hua Xu,et al.  Research and applications: A comprehensive study of named entity recognition in Chinese clinical text , 2014, J. Am. Medical Informatics Assoc..

[7]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[8]  Peter W. Foltz,et al.  Latent semantic analysis for text-based research , 1996 .

[9]  Lei Liu,et al.  Extracting important information from Chinese Operation Notes with natural language processing methods , 2014, J. Biomed. Informatics.

[10]  Thomas Lavergne,et al.  Natural language processing of radiology reports for the detection of thromboembolic diseases and clinically relevant incidental findings , 2014, BMC Bioinformatics.

[11]  Ralph Grishman,et al.  Discovery Procedures for Sublanguage Selectional Patterns: Initial Experiments , 1986, Comput. Linguistics.

[12]  Siddhartha Jonnalagadda,et al.  Towards a semantic lexicon for clinical natural language processing , 2012, AMIA.

[13]  Naomi Sager,et al.  Research Paper: Natural Language Processing and the Representation of Clinical Data , 1994, J. Am. Medical Informatics Assoc..

[14]  K. Bretonnel Cohen,et al.  A shared task involving multi-label classification of clinical free text , 2007, BioNLP@ACL.

[15]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[16]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[17]  Olga Patterson,et al.  Document clustering of clinical narratives: a systematic study of clinical sublanguages. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[18]  Galia Angelova,et al.  Closure Properties of Bulgarian Clinical Text , 2013, RANLP.

[19]  Ralph Grishman,et al.  Computational linguistics : an introduction , 1986 .

[20]  Z. Harris A Theory of Language and Information: A Mathematical Approach , 1991 .

[21]  Clement J. McDonald,et al.  De-identification of Address, Date, and Alphanumeric Identifiers in Narrative Clinical Reports , 2014, AMIA.

[22]  Olga Patterson,et al.  Automatic acquisition of sublanguage semantic schema: towards the word sense disambiguation of clinical narratives. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[23]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[24]  Ralph Grishman,et al.  The linguistic string parser , 1973, AFIPS National Computer Conference.

[25]  Galia Angelova,et al.  Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora , 2014, LREC.

[26]  Vijay Garla Kernel methods and semantic techniques for clinical text classification , 2012 .