Biomedical and Clinical English Model Packages in the Stanza Python NLP Library

We introduce biomedical and clinical English model packages for the Stanza Python NLP library. These packages offer accurate syntactic analysis and named entity recognition capabilities for biomedical and clinical text, by combining Stanza's fully neural architecture with a wide variety of open datasets as well as large-scale unsupervised biomedical and clinical text data. We show via extensive experiments that our packages achieve syntactic analysis and named entity recognition performance that is on par with or surpasses state-of-the-art results. We further show that these models do not compromise speed compared to existing toolkits when GPU acceleration is available, and are made easy to download and use with Stanza's Python interface. A demonstration of our packages is available at: http://stanza.run/bio.

[1]  Sampo Pyysalo,et al.  Neural Dependency Parsing of Biomedical Text: TurkuNLP entry in the CRAFT Structural Annotation Task , 2019, EMNLP.

[2]  Daniel King,et al.  ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing , 2019, BioNLP@ACL.

[3]  Scott T. Weiss,et al.  Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system , 2006, BMC Medical Informatics Decis. Mak..

[4]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[5]  K. Bretonnel Cohen,et al.  A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools , 2012, BMC Bioinformatics.

[6]  Qingyu Chen,et al.  BioWordVec, improving biomedical word embeddings with subword information and MeSH , 2019, Scientific Data.

[7]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[8]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[9]  William W. Cohen,et al.  PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[10]  Slav Petrov,et al.  Globally Normalized Transition-Based Neural Networks , 2016, ACL.

[11]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[12]  Martin Potthast,et al.  CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2018, CoNLL.

[13]  Nan Du,et al.  Extracting Symptoms and their Status from Clinical Conversations , 2019, ACL.

[14]  Hong Yu,et al.  AskHERMES: An online question answering system for complex clinical questions , 2011, J. Biomed. Informatics.

[15]  Ariel Deardorff,et al.  Why do biomedical researchers learn to program? An exploratory investigation , 2020, Journal of the Medical Library Association : JMLA.

[16]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[17]  Lawrence Hunter,et al.  CRAFT Shared Tasks 2019 Overview — Integrated Structure, Semantics, and Coreference , 2019, EMNLP.

[18]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[19]  Sampo Pyysalo,et al.  Overview of the Cancer Genetics and Pathway Curation tasks of BioNLP Shared Task 2013 , 2015, BMC Bioinformatics.

[20]  Sampo Pyysalo,et al.  Anatomical entity mention recognition at literature scale , 2013, Bioinform..

[21]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[22]  Sowmya R. Rao,et al.  Use of electronic health records in U.S. hospitals. , 2009, The New England journal of medicine.

[23]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[24]  Karin M. Verspoor,et al.  From POS tagging to dependency parsing for biomedical event extraction , 2018, BMC Bioinformatics.

[25]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[26]  K. Cohen,et al.  Biomedical language processing: what's beyond PubMed? , 2006, Molecular cell.

[27]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[28]  Christopher D. Manning,et al.  Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks , 2016, LREC.

[29]  Hoifung Poon,et al.  Literome: PubMed-scale genomic knowledge base in the cloud , 2014, Bioinform..

[30]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[31]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[32]  L. Jensen,et al.  The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text , 2013, PloS one.

[33]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[34]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[35]  Eugene Charniak,et al.  Self-Training for Biomedical Parsing , 2008, ACL.

[36]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[37]  Sampo Pyysalo,et al.  Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection , 2020, LREC.

[38]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[39]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[40]  Samuel R. Bowman,et al.  A Gold Standard Dependency Corpus for English , 2014, LREC.

[41]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[42]  Yu Zhang,et al.  Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning , 2018, bioRxiv.

[43]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[44]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[45]  Hongfang Liu,et al.  CLAMP – a toolkit for efficiently building customized clinical natural language processing pipelines , 2017, J. Am. Medical Informatics Assoc..

[46]  Saeed Hassanpour,et al.  Artificial Intelligence in Medicine , 2015 .

[47]  Timothy Dozat,et al.  Universal Dependency Parsing from Scratch , 2019, CoNLL.