Development of a Pediatric Text-Corpus for Part-of-Speech Tagging

Most efforts in natural language processing (NLP) have been devoted to understanding general domain data. Special domains, such as pediatric medicine, pose some unique problems and challenges. While many common sense corporas and lexicons have been created we know of none directly related to pediatric medicine. This article presents the status of an ongoing project to create a large corpus and lexicon for use by part-of-speech tagger and other NLP research tools, aimed at developing new methods in sciences related to medical domains. Experiments with automatic tagging set the limit of attainable accuracy at 92–93% on this type of texts.