Context aware Named Entity Recognition and Relation Extraction with Domain-specific language model

ChEMU 2022 tasks 1a and 1b aim to NER (Named Entity Recognition) and EE (Event Extraction) bench-marks. EE is RE (relation extraction) between trigger word and entity. We develop context-aware NER and RE models based on the domain-specific language model and achieve the state-of-the-art performance in ChEMU 2022, the public exact match f1 score of tasks 1a is 96.33, and task 1b is 92.82. For the domain-specific language model, we post-train the Bio-linkBert model with various corpora. We then select the best performing model from domain-specific benchmark datasets consisting of BLURB (Biomedical Language Understanding & Reasoning Benchmark) and ChEMU 2020. For the NER model, we choose a sequence tagging model that outperforms the span-based model in CHEMU 2022 task 1a. For the RE model, we train the model to classify the relation types or no relation between every pair of trigger words and entities in the snippet. Furthermore, we train both models using inputs that contain multiple sentences rather than a single sentence so that the model can utilize contextual information. For the ensemble, we train the best-performing model with 10-fold cross-validation and then predict the results with soft-voting. Finally, we apply rule-based post-processing to the prediction results.

[1]  J. Leskovec,et al.  LinkBERT: Pretraining Language Models with Document Links , 2022, ACL.

[2]  Peng Li,et al.  Packed Levitated Marker for Entity and Relation Extraction , 2021, ACL.

[3]  Fei Huang,et al.  Improving Biomedical Pretrained Language Models with Knowledge , 2021, BIONLP.

[4]  Timothy Baldwin,et al.  ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents , 2021, Frontiers in Research Metrics and Analytics.

[5]  Zexuan Zhong,et al.  A Frustratingly Easy Approach for Joint Entity and Relation Extraction , 2020, ArXiv.

[6]  Yang Zhang,et al.  Bio-Megatron: Larger Biomedical Domain Language Model , 2020, EMNLP.

[7]  Hiroyuki Shindo,et al.  LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention , 2020, EMNLP.

[8]  Jianfeng Gao,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[9]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[10]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[11]  Mari Ostendorf,et al.  Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction , 2018, EMNLP.

[12]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[13]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[14]  Dmitry Zelenko,et al.  Kernel Methods for Relation Extraction , 2002, J. Mach. Learn. Res..

[15]  A. Tversky Features of Similarity , 1977 .

[16]  Debarshi Kumar Sanyal,et al.  Joint Entity and Relation Extraction from Scientific Documents: Role of Linguistic Information and Entity Types , 2021, EEKE@JCDL.

[17]  Myle Ott,et al.  Analyzing the Forgetting Problem in Pretrain-Finetuning of Open-domain Dialogue Response Models , 2021, EACL.

[18]  Zhi Zhang,et al.  Melaxtech: A report for CLEF 2020 - ChEMU Task of Chemical Reaction Extraction from Patent , 2020, CLEF.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.