Common data model for natural language processing based on two existing standard information models: CDA+GrAF

An increasing need for collaboration and resources sharing in the Natural Language Processing (NLP) research and development community motivates efforts to create and share a common data model and a common terminology for all information annotated and extracted from clinical text. We have combined two existing standards: the HL7 Clinical Document Architecture (CDA), and the ISO Graph Annotation Format (GrAF; in development), to develop such a data model entitled "CDA+GrAF". We experimented with several methods to combine these existing standards, and eventually selected a method wrapping separate CDA and GrAF parts in a common standoff annotation (i.e., separate from the annotated text) XML document. Two use cases, clinical document sections, and the 2010 i2b2/VA NLP Challenge (i.e., problems, tests, and treatments, with their assertions and relations), were used to create examples of such standoff annotation documents, and were successfully validated with the XML schemata provided with both standards. We developed a tool to automatically translate annotation documents from the 2010 i2b2/VA NLP Challenge format to GrAF, and automatically generated 50 annotation documents using this tool, all successfully validated. Finally, we adapted the XSL stylesheet provided with HL7 CDA to allow viewing annotation XML documents in a web browser, and plan to adapt existing tools for translating annotation documents between CDA+GrAF and the UIMA and GATE frameworks. This common data model may ease directly comparing NLP tools and applications, combining their output, transforming and "translating" annotations between different NLP applications, and eventually "plug-and-play" of different modules in NLP applications.

[1]  A. W. Pratt Medicine, Computers, and Linguistics , 1973 .

[2]  Hans-Ulrich Prokosch,et al.  Experiences with an Interoperable Data Acquisition Platform for Multi-centric Research Networks Based on HL7 CDA , 2007, Methods of Information in Medicine.

[3]  S. Boag,et al.  XQuery 1.0 : An XML query language, W3C Working Draft 12 November 2003 , 2003 .

[4]  Hongfang Liu,et al.  Representing information in patient reports using natural language processing and the extensible markup language. , 1999, Journal of the American Medical Informatics Association : JAMIA.

[5]  P MarcusMitchell,et al.  Building a large annotated corpus of English , 1993 .

[6]  Geoffrey Leech,et al.  Studies in English linguistics for Randolph Quirk , 1980 .

[7]  David McKelvie,et al.  Hyperlink semantics for standoff markup of read-only documents , 1997 .

[8]  Seth Kulick,et al.  Integrated Annotation for Biomedical Information Extraction , 2004, HLT-NAACL 2004.

[9]  David McKelvie,et al.  The MATE workbench - An annotation tool for XML coded speech corpora , 2001, Speech Commun..

[10]  Hans-Ulrich Prokosch,et al.  Standardized Exchange of Medical Data between a Research Database, an Electronic Patient Record and an Electronic Health Record using CDA/SCIPHOX , 2005, AMIA.

[11]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[12]  Gunther Schadow,et al.  Enabling Joint Commission Medication Reconciliation Objectives with the HL7 / ASTM Continuity of Care Document Standard , 2007, AMIA.

[13]  Peter Spyns Natural Language Processing in Medicine: An Overview , 1996, Methods of Information in Medicine.

[14]  Nancy Ide,et al.  International Standard for a Linguistic Annotation Framework , 2003, Natural Language Engineering.

[15]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[16]  John E. Mattison,et al.  Review: The HL7 Clinical Document Architecture , 2001, J. Am. Medical Informatics Assoc..

[17]  Robert L Phillips,et al.  The continuity of care record. , 2004, American family physician.

[18]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[19]  Nancy Ide,et al.  GrAF: A Graph-based Format for Linguistic Annotations , 2007, LAW@ACL.

[20]  Nancy Ide,et al.  Bridging the Gaps: Interoperability for GrAF, GATE, and UIMA , 2009, Linguistic Annotation Workshop.

[21]  Jingdong Li,et al.  Model-driven CDA Clinical Document Development Framework. , 2007, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[22]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[23]  Peter D. Stetson,et al.  Model Formulation: An Electronic Health Record Based on Structured Narrative , 2008, J. Am. Medical Informatics Assoc..