Annotating a Large Representative Corpus of Clinical Notes for Parts of Speech

[We report of the procedures of developing a large representative corpus of 50,000 sentences taken from clinical notes. Previous reports of annotated corpus of clinical notes have been small and they do not represent the whole domain of clinical notes. The sentences included in this corpus have been selected from a very large raw corpus of ten thousand documents. These ten thousand documents are sampled from an internal repository of more than 700,000 documents taken from multiple health care providers. Each of the documents is de-identified to remove any PHI data. Using the Penn Treebank tagging guidelines with a bit of modifications, we annotate this corpus manually with an average inter-annotator agreement of more than 98%. The goal is to create a parts of speech annotated corpus in the clinical domain that is comparable to the Penn Treebank and also represents the totality of the contemporary text as used in the clinical domain. We also report the output of the TnT tagger trained on the initial 21,000 annotated sentences reaching a preliminary accuracy of above 96%.]

[1]  Rodney D. Nielsen,et al.  Towards comprehensive syntactic and semantic annotations of the clinical narrative , 2013, J. Am. Medical Informatics Assoc..

[2]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[3]  Ann Bies,et al.  Bracketing Guidelines For Treebank II Style Penn Treebank Project , 1995 .

[4]  Peter J. Haug,et al.  Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation , 2013, J. Am. Medical Informatics Assoc..

[5]  Christopher G. Chute,et al.  Developing a corpus of clinical notes manually annotated for part-of-speech , 2006, Int. J. Medical Informatics.

[6]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[7]  Giorgio Satta,et al.  Guided Learning for Bidirectional Sequence Classification , 2007, ACL.

[8]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[9]  Jan Hajic,et al.  Semi-Supervised Training for the Averaged Perceptron POS Tagger , 2009, EACL.

[10]  Rashmi Prasad,et al.  Part-of-speech tagging for clinical text: wall or bridge between institutions? , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[11]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[12]  Jun'ichi Tsujii,et al.  Part-of-Speech Annotation of Biology Research Abstracts , 2004, LREC.

[13]  Anders Søgaard,et al.  Simple Semi-Supervised Training of Part-Of-Speech Taggers , 2010, ACL.

[14]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[15]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[16]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.