PENNER: Pattern-enhanced Nested Named Entity Recognition in Biomedical Literature

Many biomedical entity mentions contain other entity mentions nested inside. Most current named entity recognition (NER) systems deal with only flat entities and ignore such nested entities, which may introduce errors to subsequent tasks such as relation extraction and knowledge base completion. Recently, fully supervised methods are proposed for nested named entity recognition. Despite their success on benchmark datasets, supervised methods rely on human annotation and lead to highly specialized systems that cannot be easily adapted to new entity types. In this study, we propose PENNER, a novel and effective pattern-enhanced nested named entity recognition method that relies on massive corpora plus only very weak supervision. We compare PENNER with a state-of-the-art BioNER system, PubTator, and observe great improvement at recognizing genes, chemicals, diseases and species. PENNER can also accurately extract new types of entities, such as biological process and treatment, that are not annotated by PubTator.

[1]  Jiawei Han,et al.  TruePIE: Discovering Reliable Patterns in Pattern-Based Information Extraction , 2018, KDD.

[2]  Sophia Ananiadou,et al.  A Neural Layered Model for Nested Named Entity Recognition , 2018, NAACL.

[3]  Zhiyong Lu,et al.  TaggerOne: joint named entity recognition and normalization with semi-Markov Models , 2016, Bioinform..

[4]  Beatrice Alex,et al.  Recognising Nested Named Entities in Biomedical Text , 2007, BioNLP@ACL.

[5]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[6]  Andrey Rzhetsky,et al.  Emergent behavior of growing knowledge about molecular interactions , 2005, Nature Biotechnology.

[7]  Damian Szklarczyk,et al.  The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible , 2016, Nucleic Acids Res..

[8]  Alfonso Valencia,et al.  CHEMDNER: The drugs and chemical names extraction challenge , 2015, Journal of Cheminformatics.

[9]  Subha Madhavan,et al.  eGARD: Extracting associations between genomic anomalies and drug responses from text , 2017, bioRxiv.

[10]  Jiawei Han,et al.  SetExpan: Corpus-Based Set Expansion via Context Feature Selection and Rank Ensemble , 2017, ECML/PKDD.

[11]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[12]  Hongfei Lin,et al.  CIDExtractor: A chemical-induced disease relation extraction system for biomedical literature , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[13]  Di Wu,et al.  miRCancer: a microRNA-cancer association database constructed by text mining on literature , 2013, Bioinform..

[14]  Zhe Chen,et al.  EgoSet: Exploiting Word Ego-networks and User-generated Ontology for Multifaceted Set Expansion , 2016, WSDM.

[15]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16]  Jiawei Han,et al.  MetaPAD: Meta Pattern Discovery from Massive Text Corpora , 2017, KDD.

[17]  Dan Roth,et al.  Joint Mention Extraction and Classification with Mention Hypergraphs , 2015, EMNLP.

[18]  Loren J. Martin,et al.  Different immune cells mediate mechanical pain hypersensitivity in male and female mice , 2015, Nature Neuroscience.

[19]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[20]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[21]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[22]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[23]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[24]  Christopher D. Manning,et al.  Nested Named Entity Recognition , 2009, EMNLP.

[25]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database: update 2017 , 2016, Nucleic Acids Res..

[26]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[27]  Yu Zhang,et al.  Open Information Extraction with Meta-pattern Discovery in Biomedical Literature , 2018, BCB.

[28]  Xiaowei Wang,et al.  A semantic approach for knowledge capture of MIcroRNA-Target gene interactions , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[29]  Jiawei Han,et al.  Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning , 2018 .

[30]  Cathy H. Wu,et al.  Analysis of Protein Phosphorylation and Its Functional Impact on Protein-Protein Interactions via Text Mining of the Scientific Literature. , 2017, Methods in molecular biology.

[31]  Damian Szklarczyk,et al.  STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data , 2015, Nucleic Acids Res..

[32]  Claire Cardie,et al.  Nested Named Entity Recognition Revisited , 2018, NAACL.

[33]  Wei Lu,et al.  Labeling Gaps Between Words: Recognizing Overlapping Mentions with Mention Separators , 2017, EMNLP.

[34]  Maryam Habibi,et al.  Deep learning with word embeddings improves biomedical named entity recognition , 2017, Bioinform..

[35]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[36]  Sampo Pyysalo,et al.  A neural network multi-task learning approach to biomedical named entity recognition , 2017, BMC Bioinformatics.

[37]  Charles A. Greer,et al.  Maturational changes related to dopamine in the effects of d-amphetamine, cocaine, nicotine, and strychnine on seizure susceptibility , 1979, Psychopharmacology.