论文信息 - Gazetteer Enhanced Named Entity Recognition for Code-Mixed Web Queries - 字舞流文

Gazetteer Enhanced Named Entity Recognition for Code-Mixed Web Queries

Named entity recognition (NER) for Web queries is very challenging. Queries often do not consist of well-formed sentences, and contain very little context, with highly ambiguous queried entities. Code-mixed queries, with entities in a different language than the rest of the query, pose a particular challenge in domains like e-commerce (e.g. queries containing movie or product names). This work tackles NER for code-mixed queries, where entities and non-entity query terms co-exist simultaneously in different languages. Our contributions are twofold. First, to address the lack of code-mixed NER data we create EMBER, a large-scale dataset in six languages with four different scripts. Based on Bing query data, we include numerous language combinations that showcase real-world search scenarios. Secondly, we propose a novel gated architecture that enhances existing multi-lingual Transformers with a Mixture-of-Experts model to dynamically infuse multi-lingual gazetteers, allowing it to simultaneously differentiate and handle entities and non-entity query terms in multiple languages. Experimental evaluation on code-mixed queries in several languages shows that our approach efficiently utilizes gazetteers to recognize entities in code-mixed queries with an F1=68%, an absolute improvement of +31% over a non-gazetteer baseline.

Shervin Malmasi | Besnik Fetahu | Oleg Rokhlenko | Anjie Fang | S. Malmasi | O. Rokhlenko | Anjie Fang | B. Fetahu | Oleg Rokhlenko

[1] Leon Derczynski,et al. Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition , 2017, NUT@EMNLP.

[2] Hengyi Fu,et al. Query Reformulation Patterns of Mixed Language Queries in Different Search Intents , 2017, CHIIR.

[3] John P. McCrae,et al. Named Entity Recognition for Code-Mixed Indian Corpus using Meta Embedding , 2020, 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS).

[4] Cherukuri Aswani Kumar,et al. Query expansion using named entity disambiguation for a question‐answering system , 2020, Concurr. Comput. Pract. Exp..

[5] Gerhard Weikum,et al. Discovering emerging entities with ambiguous names , 2014, WWW.

[6] Krisztian Balog,et al. Overview of the TREC 2010 Entity Track , 2010, TREC.

[7] Parth Gupta,et al. Query expansion for mixed-script information retrieval , 2014, SIGIR.

[8] Brooke Cowan,et al. Named Entity Recognition in Travel-Related Search Queries , 2015, AAAI.

[9] Rupal Bhargava,et al. Named Entity Recognition for Code Mixing in Indian Languages using Hybrid Approach , 2016, FIRE.

[10] Yelong Shen,et al. Deep Context Modeling for Web Query Entity Disambiguation , 2017, CIKM.

[11] Chin-Yew Lin,et al. Towards Improving Neural Named Entity Recognition with Gazetteers , 2019, ACL.

[12] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[14] Wolfgang Nejdl,et al. Multiple Models for Recommending Temporal Aspects of Entities , 2018, ESWC.

[15] Pushpak Bhattacharyya,et al. A Hybrid Approach for Entity Extraction in Code-Mixed Social Media Data , 2016, FIRE.

[16] Shervin Malmasi,et al. GEMNET: Effective Gated Gazetteer Representations for Recognizing Complex Entities in Low-context Input , 2021, NAACL.

[17] Hang Li,et al. Named entity recognition in query , 2009, SIGIR.

[18] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[19] Nick Craswell,et al. ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search , 2020, CIKM.

[20] Veselin Stoyanov,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[21] Somnath Banerjee,et al. Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval , 2015, FIRE Workshops.