End-to-end Named Entity Recognition from English Speech

Named entity recognition (NER) from text has been a widely studied problem and usually extracts semantic information from text. Until now, NER from speech is mostly studied in a two-step pipeline process that includes first applying an automatic speech recognition (ASR) system on an audio sample and then passing the predicted transcript to a NER tagger. In such cases, the error does not propagate from one step to another as both the tasks are not optimized in an end-to-end (E2E) fashion. Recent studies confirm that integrated approaches (e.g., E2E ASR) outperform sequential ones (e.g., phoneme based ASR). In this paper, we introduce a first publicly available NER annotated dataset for English speech and present an E2E approach, which jointly optimizes the ASR and NER tagger components. Experimental results show that the proposed E2E approach outperforms the classical two-step approach. We also discuss how NER from speech can be used to handle out of vocabulary (OOV) words in an ASR system.

[1]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Yossi Matias,et al.  Audio De-identification - a New Entity Recognition Task , 2019, NAACL.

[3]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[4]  Roland Vollgraf,et al.  Pooled Contextualized Embeddings for Named Entity Recognition , 2019, NAACL.

[5]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[6]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[7]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[8]  Jan Hajic,et al.  Neural Architectures for Nested NER through Linearization , 2019, ACL.

[9]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[10]  Daniel Jurafsky,et al.  First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs , 2014, ArXiv.

[11]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[12]  Gökhan Tür,et al.  Beyond ASR 1-best: Using word confusion networks in spoken language understanding , 2006, Comput. Speech Lang..

[13]  Luke S. Zettlemoyer,et al.  Cloze-driven Pretraining of Self-attention Networks , 2019, EMNLP.

[14]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[15]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[16]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[17]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[18]  Olivier Galibert,et al.  How to evaluate ASR output for named entity recognition? , 2015, INTERSPEECH.

[19]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Hinrich Schütze,et al.  Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition , 2011, ACL.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[23]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[24]  Yannick Estève,et al.  End-to-end named entity extraction from speech , 2018, ArXiv.

[25]  Mark Dredze,et al.  OOV Sensitive Named-Entity Recognition in Speech , 2011, INTERSPEECH.

[26]  Jingbo Zhu,et al.  Improved Differentiable Architecture Search for Language Modeling and Named Entity Recognition , 2019, EMNLP.