A Large-Scale Chinese Multimodal NER Dataset with Speech Clues

In this paper, we aim to explore an uncharted territory, which is Chinese multimodal named entity recognition (NER) with both textual and acoustic contents. To achieve this, we construct a large-scale human-annotated Chinese multimodal NER dataset, named CNERTA. Our corpus totally contains 42,987 annotated sentences accompanying by 71 hours of speech data. Based on this dataset, we propose a family of strong and representative baseline models, which can leverage textual features or multimodal features. Upon these baselines, to capture the natural monotonic alignment between the textual modality and the acoustic modality, we further propose a simple multimodal multitask model by introducing a speech-to-text alignment auxiliary task. Through extensive experiments, we observe that: (1) Progressive performance boosts as we move from unimodal to multimodal, verifying the necessity of integrating speech clues into Chinese NER. (2) Our proposed model yields state-of-the-art (SoTA) results on CNERTA, demonstrating its effectiveness. For further research, the annotated dataset is publicly available at http://github.com/DianboWork/ CNERTA.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Gina-Anne Levow,et al.  The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[4]  Yue Zhang,et al.  Porous Lattice Transformer Encoder for Chinese NER , 2020, COLING.

[5]  Mohan Li,et al.  End-to-end Speech Recognition with Adaptive Computation Steps , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[7]  Ignazio Gallo,et al.  Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[8]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[9]  Jun Zhao,et al.  Adversarial Transfer Learning for Chinese Named Entity Recognition with Self-Attention Mechanism , 2018, EMNLP.

[10]  Yue Zhang,et al.  Chinese NER Using Lattice LSTM , 2018, ACL.

[11]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[12]  Tong Zhang,et al.  ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations , 2019, FINDINGS.

[13]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[14]  Zengwei Zheng,et al.  RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER , 2020, COLING.

[15]  Leonardo Neves,et al.  Multimodal Named Entity Recognition for Short Social Media Posts , 2018, NAACL.

[16]  Kazuhiko Ohe,et al.  TEXT2TABLE: Medical Text Summarization System Based on Named Entity Recognition and Modality Identification , 2009, BioNLP@HLT-NAACL.

[17]  Jan Hajic,et al.  Neural Architectures for Nested NER through Linearization , 2019, ACL.

[18]  Yi Qian,et al.  Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries. , 2014, Journal of the American Medical Informatics Association : JAMIA.

[19]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[20]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[21]  Heng Ji,et al.  Visual Attention Model for Name Tagging in Multimodal Social Media , 2018, ACL.

[22]  Minlong Peng,et al.  Simplify the Usage of Lexicon in Chinese NER , 2019, ACL.

[23]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[24]  Xipeng Qiu,et al.  FLAT: Chinese NER Using Flat-Lattice Transformer , 2020, ACL.

[25]  Jianfei Yu,et al.  Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer , 2020, ACL.

[26]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[27]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Luo Si,et al.  A Neural Multi-digraph Model for Chinese NER with Gazetteers , 2019, ACL.

[30]  Dmitry Zelenko,et al.  Kernel Methods for Relation Extraction , 2002, J. Mach. Learn. Res..

[31]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[32]  Xing Xie,et al.  Neural Chinese Named Entity Recognition via CNN-LSTM-CRF and Joint Training with Word Segmentation , 2019, WWW.

[33]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[34]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[35]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[36]  Yueran Zu,et al.  An Encoding Strategy Based Word-Character LSTM for Chinese NER , 2019, NAACL.

[37]  Diego Mollá Aliod,et al.  Named Entity Recognition for Question Answering , 2006, ALTA.

[38]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[39]  Tao Gui,et al.  A Lexicon-Based Graph Neural Network for Chinese NER , 2019, EMNLP.

[40]  Dong Yu,et al.  Component Fusion: Learning Replaceable Language Model Component for End-to-end Speech Recognition System , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  David V. Anderson,et al.  Speech recognition using filter-bank features , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[42]  Timothy Baldwin,et al.  Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition , 2015, NUT@IJCNLP.

[43]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[44]  Xuanjing Huang,et al.  Adaptive Co-attention Network for Named Entity Recognition in Tweets , 2018, AAAI.

[45]  Nanyun Peng,et al.  Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning , 2016, ACL.

[46]  Jiangyan Yi,et al.  Synchronous Transformers for end-to-end Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[48]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[49]  Wanxiang Che,et al.  Revisiting Pre-Trained Models for Chinese Natural Language Processing , 2020, FINDINGS.

[50]  Shengping Liu,et al.  Leverage Lexical Knowledge for Chinese Named Entity Recognition via Collaborative Graph Network , 2019, EMNLP.