USTC-NELSLIP at SemEval-2023 Task 2: Statistical Construction and Dual Adaptation of Gazetteer for Multilingual Complex NER

This paper describes the system developed by the USTC-NELSLIP team for SemEval-2023 Task 2 Multilingual Complex Named Entity Recognition (MultiCoNER II). We propose a method named Statistical Construction and Dual Adaptation of Gazetteer (SCDAG) for Multilingual Complex NER. The method first utilizes a statistics-based approach to construct a gazetteer. Secondly, the representations of gazetteer networks and language models are adapted by minimizing the KL divergence between them at the sentence-level and entity-level. Finally, these two networks are then integrated for supervised named entity recognition (NER) training. The proposed method is applied to several state-of-the-art Transformer-based NER models with a gazetteer built from Wikidata, and shows great generalization ability across them. The final predictions are derived from an ensemble of these trained models. Experimental results and detailed analysis verify the effectiveness of the proposed method. The official results show that our system ranked 1st on one track (Hindi) in this task.

[1]  S. Malmasi,et al.  SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2) , 2023, SEMEVAL.

[2]  QUAN LIU,et al.  Wider & Closer: Mixture of Short-channel Distillers for Zero-shot Cross-lingual Named Entity Recognition , 2022, EMNLP.

[3]  Shervin Malmasi,et al.  Gazetteer Enhanced Named Entity Recognition for Code-Mixed Web Queries , 2021, SIGIR.

[4]  Shervin Malmasi,et al.  GEMNET: Effective Gated Gazetteer Representations for Recognizing Complex Entities in Low-context Input , 2021, NAACL.

[5]  Fei Huang,et al.  Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning , 2021, ACL.

[6]  Chen Jia,et al.  Entity Enhanced BERT Pre-training for Chinese NER , 2020, EMNLP.

[7]  Nick Craswell,et al.  ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search , 2020, CIKM.

[8]  Juntao Yu,et al.  Named Entity Recognition as Dependency Parsing , 2020, ACL.

[9]  Jianfeng Gao,et al.  Adversarial Training for Large Neural Language Models , 2020, ArXiv.

[10]  Bettina Berendt,et al.  RobBERT: a Dutch RoBERTa-based Language Model , 2020, FINDINGS.

[11]  Jianfeng Gao,et al.  SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2019, ACL.

[12]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[13]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[14]  Chin-Yew Lin,et al.  Towards Improving Neural Named Entity Recognition with Gazetteers , 2019, ACL.

[15]  Zhen-Hua Ling,et al.  Hybrid semi-Markov CRF for Neural Sequence Labeling , 2018, ACL.

[16]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[17]  Jianfeng Gao,et al.  MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , 2016, CoCo@NIPS.

[18]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[19]  Jinho D. Choi,et al.  Targetable Named Entity Recognition in Social Media , 2014, ArXiv.

[20]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[21]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[22]  S. Malmasi,et al.  SemEval-2022 Task 11: Multilingual Complex Named Entity Recognition (MultiCoNER) , 2022, SEMEVAL.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.