Encoding Syntactic Knowledge in Transformer Encoder for Intent Detection and Slot Filling

We propose a novel Transformer encoder-based architecture with syntactical knowledge encoded for intent detection and slot filling. Specifically, we encode syntactic knowledge into the Transformer encoder by jointly training it to predict syntactic parse ancestors and part-of-speech of each token via multi-task learning. Our model is based on self-attention and feed-forward layers and does not require external syntactic information to be available at inference time. Experiments show that on two benchmark datasets, our models with only two Transformer encoder layers achieve state-of-the-art results. Compared to the previously best performed model without pre-training, our models achieve absolute F1 score and accuracy improvement of 1.59% and 0.85% for slot filling and intent detection on the SNIPS dataset, respectively. Our models also achieve absolute F1 score and accuracy improvement of 0.1% and 0.34% for slot filling and intent detection on the ATIS dataset, respectively, over the previously best performed model. Furthermore, the visualization of the selfattention weights illustrates the benefits of incorporating syntactic information during training.

[1]  Bing Liu,et al.  Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling , 2016, INTERSPEECH.

[2]  Mohit Iyyer,et al.  Syntactically Supervised Transformers for Faster Neural Machine Translation , 2019, ACL.

[3]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[4]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[5]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Gökhan Tür,et al.  What is left to be understood in ATIS? , 2010, 2010 IEEE Spoken Language Technology Workshop.

[10]  Chih-Li Huo,et al.  Slot-Gated Modeling for Joint Slot Filling and Intent Prediction , 2018, NAACL.

[11]  Xiaodong Zhang,et al.  Graph LSTM with Context-Gated Mechanism for Spoken Language Understanding , 2020, AAAI.

[12]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[13]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[14]  James F. Allen Natural language understanding (2nd ed.) , 1995 .

[15]  Geoffrey Zweig,et al.  Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Omer Levy,et al.  Emergent linguistic structure in artificial neural networks trained by self-supervision , 2020, Proceedings of the National Academy of Sciences.

[17]  Gökhan Tür,et al.  Sentence simplification for spoken language understanding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Giuseppe Riccardi,et al.  How may I help you? , 1997, Speech Commun..

[19]  Francesco Caltagirone,et al.  Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces , 2018, ArXiv.

[20]  Christopher D. Manning,et al.  Joint Parsing and Named Entity Recognition , 2009, NAACL.

[21]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[22]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[23]  Alok Ranjan Pal,et al.  An Approach to Speed-up the Word Sense Disambiguation Procedure through Sense Filtering , 2013, ArXiv.

[24]  Karin M. Verspoor,et al.  An Improved Neural Network Model for Joint POS Tagging and Dependency Parsing , 2018, CoNLL.

[25]  Gökhan Tür,et al.  Syntax or semantics? knowledge-guided joint semantic frame parsing , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[26]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[27]  Yoshimasa Tsuruoka,et al.  A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks , 2016, EMNLP.

[28]  Wen Wang,et al.  BERT for Joint Intent Classification and Slot Filling , 2019, ArXiv.

[29]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[30]  Andrew McCallum,et al.  Linguistically-Informed Self-Attention for Semantic Role Labeling , 2018, EMNLP.

[31]  Anna Maria Di Sciullo,et al.  Natural Language Understanding , 2009, SoMeT.

[32]  P. J. Price,et al.  Evaluation of Spoken Language Systems: the ATIS Domain , 1990, HLT.

[33]  Alessandro Moschitti,et al.  Spoken language understanding with kernels for syntactic/semantic structures , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Zhuosheng Zhang,et al.  SG-Net: Syntax-Guided Machine Reading Comprehension , 2019, AAAI.

[36]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[37]  Gökhan Tür,et al.  Multi-Domain Joint Semantic Frame Parsing Using Bi-Directional RNN-LSTM , 2016, INTERSPEECH.

[38]  Meina Song,et al.  A Novel Bi-directional Interrelated Model for Joint Intent Detection and Slot Filling , 2019, ACL.

[39]  Gökhan Tür,et al.  Optimizing SVMs for complex call classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[40]  Houfeng Wang,et al.  A Joint Model of Intent Determination and Slot Filling for Spoken Language Understanding , 2016, IJCAI.

[41]  Yangming Li,et al.  A Stack-Propagation Framework with Token-Level Intent Detection for Spoken Language Understanding , 2019, EMNLP.

[42]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[43]  Vinay Hegde,et al.  Hidden Markov model for POS tagging in word sense disambiguation , 2016, 2016 International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS).

[44]  Guoyin Wang,et al.  Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding , 2019, ArXiv.

[45]  Gökhan Tür,et al.  Towards deeper understanding: Deep convex networks for semantic utterance classification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).