Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing

Background A high-quality gold standard is vital for supervised, machine learning-based, clinical natural language processing (NLP) systems. In clinical NLP projects, expert annotators traditionally create the gold standard. However, traditional annotation is expensive and time-consuming. To reduce the cost of annotation, general NLP projects have turned to crowdsourcing based on Web 2.0 technology, which involves submitting smaller subtasks to a coordinated marketplace of workers on the Internet. Many studies have been conducted in the area of crowdsourcing, but only a few have focused on tasks in the general NLP field and only a handful in the biomedical domain, usually based upon very small pilot sample sizes. In addition, the quality of the crowdsourced biomedical NLP corpora were never exceptional when compared to traditionally-developed gold standards. The previously reported results on medical named entity annotation task showed a 0.68 F-measure based agreement between crowdsourced and traditionally-developed corpora. Objective Building upon previous work from the general crowdsourcing research, this study investigated the usability of crowdsourcing in the clinical NLP domain with special emphasis on achieving high agreement between crowdsourced and traditionally-developed corpora. Methods To build the gold standard for evaluating the crowdsourcing workers’ performance, 1042 clinical trial announcements (CTAs) from the ClinicalTrials.gov website were randomly selected and double annotated for medication names, medication types, and linked attributes. For the experiments, we used CrowdFlower, an Amazon Mechanical Turk-based crowdsourcing platform. We calculated sensitivity, precision, and F-measure to evaluate the quality of the crowd’s work and tested the statistical significance (P<.001, chi-square test) to detect differences between the crowdsourced and traditionally-developed annotations. Results The agreement between the crowd’s annotations and the traditionally-generated corpora was high for: (1) annotations (0.87, F-measure for medication names; 0.73, medication types), (2) correction of previous annotations (0.90, medication names; 0.76, medication types), and excellent for (3) linking medications with their attributes (0.96). Simple voting provided the best judgment aggregation approach. There was no statistically significant difference between the crowd and traditionally-generated corpora. Our results showed a 27.9% improvement over previously reported results on medication named entity annotation task. Conclusions This study offers three contributions. First, we proved that crowdsourcing is a feasible, inexpensive, fast, and practical approach to collect high-quality annotations for clinical text (when protected health information was excluded). We believe that well-designed user interfaces and rigorous quality control strategy for entity annotation and linking were critical to the success of this work. Second, as a further contribution to the Internet-based crowdsourcing field, we will publicly release the JavaScript and CrowdFlower Markup Language infrastructure code that is necessary to utilize CrowdFlower’s quality control and crowdsourcing interfaces for named entity annotations. Finally, to spur future research, we will release the CTA annotations that were generated by traditional and crowdsourced approaches.

[1]  Panayiotis G. Georgiou,et al.  Analyzing quality of crowd-sourced speech transcriptions of noisy audio for acoustic model adaptation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Stephan Vogel,et al.  Can Crowds Build parallel corpora for Machine Translation Systems? , 2010, Mturk@HLT-NAACL.

[3]  Chris Callison-Burch,et al.  Using Mechanical Turk to Build Machine Translation Evaluation Sets , 2010, Mturk@HLT-NAACL.

[4]  Matthew Lease,et al.  Improving Consensus Accuracy via Z-Score and Weighted Voting , 2011, Human Computation.

[5]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[6]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[7]  Alon Lavie,et al.  Turker-Assisted Paraphrasing for English-Arabic Machine Translation , 2010, Mturk@HLT-NAACL.

[8]  Mark Dredze,et al.  Annotating Named Entities in Twitter Data with Crowdsourcing , 2010, Mturk@HLT-NAACL.

[9]  Louise Deléger,et al.  A sequence labeling approach to link medications and their attributes in clinical notes and clinical trial announcements for information extraction , 2012, J. Am. Medical Informatics Assoc..

[10]  Abhimanu Kumar Modeling Annotator Accuracies for Supervised Learning , 2011 .

[11]  Lynette Hirschman,et al.  Validating Candidate Gene-Mutation Relations in MEDLINE Abstracts via Crowdsourcing , 2012, DILS.

[12]  Noah A. Smith,et al.  Shedding (a Thousand Points of) Light on Biased Language , 2010, Mturk@HLT-NAACL.

[13]  Gianluca Demartini,et al.  ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking , 2012, WWW.

[14]  Matthew Lease,et al.  Learning to rank from a noisy crowd , 2011, SIGIR.

[15]  Katrin Kirchhoff,et al.  Using Crowdsourcing Technology for Testing Multilingual Public Health Promotion Materials , 2012, Journal of medical Internet research.

[16]  Prakash M. Nadkarni,et al.  Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions , 2011, J. Am. Medical Informatics Assoc..

[17]  Fei Xia,et al.  Preliminary Experiments with Amazon’s Mechanical Turk for Annotating Medical Named Entities , 2010, Mturk@HLT-NAACL.

[18]  Meliha Yetisgen-Yildiz,et al.  Annotating Large Email Datasets for Named Entity Recognition with Mechanical Turk , 2010, Mturk@HLT-NAACL.

[19]  Philip V. Ogren,et al.  Knowtator: A Protégé plug-in for annotated corpus construction , 2006, NAACL.

[20]  Miguel Angel Luengo-Oroz,et al.  Crowdsourcing Malaria Parasite Quantification: An Online Game for Analyzing Images of Infected Thick Blood Smears , 2012, Journal of medical Internet research.

[21]  A. Edwards,et al.  Leveraging Crowdsourcing to Facilitate the Discovery of New Medicines , 2011, Science Translational Medicine.

[22]  Rodney D. Nielsen,et al.  Towards comprehensive syntactic and semantic annotations of the clinical narrative , 2013, J. Am. Medical Informatics Assoc..

[23]  Jacob Andreas,et al.  Corpus Creation for New Genres: A Crowdsourced Approach to PP Attachment , 2010, Mturk@HLT-NAACL.

[24]  Fei Xia,et al.  Using Amazon's Mechanical Turk for Annotating Medical Named Entities. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[25]  James R. Glass,et al.  A Transcription Task for Crowdsourcing with Automatic Quality Control , 2011, INTERSPEECH.

[26]  Keith Marsolo,et al.  Building Gold Standard Corpora for Medical Natural Language Processing Tasks , 2012, AMIA.

[27]  Matthew Lease,et al.  Semi-Supervised Consensus Labeling for Crowdsourcing , 2011 .