A comparative study of recurrent neural network models for lexical domain classification

Domain classification is a critical pre-processing step for many speech understanding and dialog systems, as it allows for certain types of utterances to be routed to specialized subsystems. In previous work, we explored various neural network (NN) architectures for binary utterance classification based on lexical features, and found that they improved upon more traditional statistical baselines. In this paper we generalize to an n-way classification task, and test the best-performing NN architectures on a large, real-world dataset from the Cortana personal assistant application. As in the earlier work, we find that recurrent NNs with gated memory units (LSTM and GRU) perform best, beating out state-of-the-art baseline systems based on language models or boosting classifiers. NN classifiers can still benefit from combining their posterior class estimates with traditional language model likelihood ratios, via a logistic regression combiner. We also investigate whether it is better to use an ensemble of binary classifiers or a NN trained for n-way classification, and how each approach performs in combination with the baseline classifiers. The best overall results are obtained by first combining an ensemble of binary GRU-NN classifiers with LM likelihood ratios, followed by picking the highest class posterior estimate.

[1]  Ruhi Sarikaya,et al.  Contextual domain classification in spoken language understanding systems using recurrent neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[3]  Andreas Stolcke,et al.  A comparative study of neural network models for lexical intent classification , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[4]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Andreas Stolcke,et al.  Recurrent neural network and LSTM models for lexical utterance classification , 2015, INTERSPEECH.

[7]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[8]  Gökhan Tür,et al.  Sentence simplification for spoken language understanding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Geoffrey Zweig,et al.  Spoken language understanding using long short-term memory neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[10]  Pascal Druyts,et al.  Applying Logistic Regression to the Fusion of the NIST'99 1-Speaker Submissions , 2000, Digit. Signal Process..

[11]  Andreas Stolcke,et al.  Addressee detection for dialog systems using temporal and spectral dimensions of speaking style , 2013, INTERSPEECH.

[12]  Andreas Stolcke,et al.  Using Out-of-Domain Data for Lexical Addressee Detection in Human-Human-Computer Dialog , 2013, HLT-NAACL.

[13]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[14]  Andreas Stolcke,et al.  Neural network models for lexical addressee detection , 2014, INTERSPEECH.

[15]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[16]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.