Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows

BackgroundChronic obstructive pulmonary disease (COPD) is a life-threatening lung disorder whose recent prevalence has led to an increasing burden on public healthcare. Phenotypic information in electronic clinical records is essential in providing suitable personalised treatment to patients with COPD. However, as phenotypes are often “hidden” within free text in clinical records, clinicians could benefit from text mining systems that facilitate their prompt recognition. This paper reports on a semi-automatic methodology for producing a corpus that can ultimately support the development of text mining tools that, in turn, will expedite the process of identifying groups of COPD patients.MethodsA corpus of 30 full-text papers was formed based on selection criteria informed by the expertise of COPD specialists. We developed an annotation scheme that is aimed at producing fine-grained, expressive and computable COPD annotations without burdening our curators with a highly complicated task. This was implemented in the Argo platform by means of a semi-automatic annotation workflow that integrates several text mining tools, including a graphical user interface for marking up documents.ResultsWhen evaluated using gold standard (i.e., manually validated) annotations, the semi-automatic workflow was shown to obtain a micro-averaged F-score of 45.70% (with relaxed matching). Utilising the gold standard data to train new concept recognisers, we demonstrated that our corpus, although still a work in progress, can foster the development of significantly better performing COPD phenotype extractors.ConclusionsWe describe in this work the means by which we aim to eventually support the process of COPD phenotype curation, i.e., by the application of various text mining tools integrated into an annotation workflow. Although the corpus being described is still under development, our results thus far are encouraging and show great potential in stimulating the development of further automatic COPD phenotype extractors.

[1]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[2]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[3]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[4]  Sampo Pyysalo,et al.  Anatomical entity mention recognition at literature scale , 2013, Bioinform..

[5]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[6]  Sophia Ananiadou,et al.  Building a semantically annotated corpus for congestive heart and renal failure from clinical records and the literature , 2014, Louhi@EACL.

[7]  S. Ananiadou,et al.  A Strategy for Annotating Clinical Records with Phenotypic Information relating to the Chronic Obstructive Pulmonary Disease , 2014 .

[8]  N. Anthonisen,et al.  Chronic obstructive pulmonary disease (COPD). , 1998, American journal of respiratory and critical care medicine.

[9]  K. Bretonnel Cohen,et al.  A shared task involving multi-label classification of clinical free text , 2007, BioNLP@ACL.

[10]  T. H. Kyaw,et al.  Multiparameter Intelligent Monitoring in Intensive Care II: A public-access intensive care unit database* , 2011, Critical care medicine.

[11]  Damian Smedley,et al.  The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data , 2014, Nucleic Acids Res..

[12]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[13]  Hong Cui,et al.  Semantic annotation of biosystematics literature without training examples , 2010, J. Assoc. Inf. Sci. Technol..

[14]  Edwin K Silverman,et al.  Chronic obstructive pulmonary disease phenotypes: the future of COPD. , 2010, American journal of respiratory and critical care medicine.

[15]  Sophia Ananiadou,et al.  Processing biological literature with customizable Web services supporting interoperable formats , 2014, Database J. Biol. Databases Curation.

[16]  Jian Zhang,et al.  Protein Ontology: a controlled structured network of protein entities , 2013, Nucleic Acids Res..

[17]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..

[18]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[19]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[20]  Sanna Salanterä,et al.  Overview of the ShARe/CLEF eHealth Evaluation Lab 2013 , 2013, CLEF.

[21]  Marcel H. Schulz,et al.  Clinical diagnostics in human genetics with semantic similarity searches in ontologies. , 2009, American journal of human genetics.

[22]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[23]  Medicaid Services,et al.  International Classification of Diseases, Ninth Revision, Clinical Modification , 2011 .

[24]  Klemens Böhm,et al.  Semi-Automated XML Markup of Biosystematic Legacy Literature with the Goldengate Editor , 2007, Pacific Symposium on Biocomputing.

[25]  M. Saeed Multiparameter Intelligent Monitoring in Intensive Care II ( MIMIC-II ) : A public-access intensive care unit database , 2011 .

[26]  Peter J. Haug,et al.  Research Paper: Automatic Detection of Acute Bacterial Pneumonia from Chest X-ray Reports , 2000, J. Am. Medical Informatics Assoc..

[27]  Gerard Tromp,et al.  Design patterns for the development of electronic health record-driven phenotype extraction algorithms , 2014, J. Biomed. Informatics.

[28]  Djoerd Hiemstra,et al.  Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics , 2012, Lecture Notes in Computer Science.

[29]  Sophia Ananiadou,et al.  Argo: an integrative, interactive, text mining-based workbench supporting curation , 2012, Database J. Biol. Databases Curation.

[30]  Christopher G. Chute,et al.  Constructing Evaluation Corpora for Automated Clinical Named Entity Recognition , 2008, LREC.

[31]  Keith Marsolo,et al.  Building Gold Standard Corpora for Medical Natural Language Processing Tasks , 2012, AMIA.

[32]  Yahong Chen,et al.  [Chronic obstructive pulmonary disease phenotypes]. , 2011, Zhonghua jie he he hu xi za zhi = Zhonghua jiehe he huxi zazhi = Chinese journal of tuberculosis and respiratory diseases.

[33]  Sophia Ananiadou,et al.  Text-mining-assisted biocuration workflows in Argo , 2014, Database J. Biol. Databases Curation.

[34]  Monte Westerfield,et al.  Linking Human Diseases to Animal Models Using Ontology-Based Phenotype Annotation , 2009, PLoS biology.

[35]  Hong Cui CharaParser for fine-grained semantic annotation of organism morphological descriptions , 2012, J. Assoc. Inf. Sci. Technol..

[36]  Angus Roberts,et al.  Building a semantically annotated corpus of clinical texts , 2009, J. Biomed. Informatics.

[37]  Paula M. Mabee,et al.  Phenex: Ontological Annotation of Phenotypic Diversity , 2010, PloS one.

[38]  Jian Li,et al.  Estimation of tumor heterogeneity using CGH array data , 2009, BMC Bioinformatics.

[39]  K. Kupka,et al.  International classification of diseases: ninth revision. , 1978, WHO chronicle.

[40]  Kalina Bontcheva,et al.  Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics , 2013, PLoS Comput. Biol..

[41]  Cynthia L. Smith,et al.  Integrating phenotype ontologies across multiple species , 2010, Genome Biology.

[42]  Shuying Shen,et al.  Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease , 2009, BMC Bioinformatics.

[43]  Sophia Ananiadou,et al.  Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics , 2015, Journal of Cheminformatics.

[44]  Sophia Ananiadou,et al.  Improving the Extraction of Clinical Concepts from Clinical Records , 2014 .

[45]  S. Lewis,et al.  Uberon, an integrative multi-species anatomy ontology , 2012, Genome Biology.

[46]  J. Pathak,et al.  Electronic health records-driven phenotyping: challenges, recent advances, and perspectives. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[47]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[48]  K. Bretonnel Cohen,et al.  Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters , 2014, BMC Bioinformatics.

[49]  Sherri de Coronado,et al.  NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information , 2007, J. Biomed. Informatics.

[50]  Peter J. Haug,et al.  Natural language processing to extract medical problems from electronic clinical documents: Performance evaluation , 2006, J. Biomed. Informatics.

[51]  Son Doan,et al.  Application of information technology: MedEx: a medication information extraction system for clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[52]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[53]  Mark A. Musen,et al.  The Open Biomedical Annotator , 2009, Summit on translational bioinformatics.

[54]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[55]  Christoph Steinbeck,et al.  The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013 , 2012, Nucleic Acids Res..

[56]  Søren Brunak,et al.  Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts , 2011, PLoS Comput. Biol..

[57]  Hilmar Lapp,et al.  Evolutionary Characters, Phenotypes and Ontologies: Curating Data from the Systematic Biology Literature , 2010, PloS one.