Contextual keyword spotting in lecture video with deep convolutional neural network

We present a Keyword Spotting (KWS) system for lecture video that uses Deep Convolutional Neural Network (CNN) architecture. The CNN architecture enables us to use lecture video speech data that was taken under a diverse condition as a training dataset without any specialized feature. We also employ the language model and word stemming on the resulting transcript to correct misspelling and error automatically. Our model yield an average accuracy of 69.01% which is 34.53% improvement from the baseline model. We also show that an augmentation of the training dataset improves the model robustness against the testing dataset.

[1]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jay F. Nunamaker,et al.  A natural language approach to content-based video indexing and retrieval for interactive e-learning , 2004, IEEE Transactions on Multimedia.

[3]  Vivek Tyagi Hybrid context dependent CD-DNN-HMM Keyword Spotting (KWS) in speech conversations , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[4]  Ying Zhang,et al.  Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[5]  Junichi Yamagishi,et al.  SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[6]  Anurag Agarwal,et al.  Assisted keyword indexing for lecture videos using unsupervised keyword spotting , 2016, Pattern Recognit. Lett..

[7]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[8]  Meg Murray,et al.  Student Interaction with Online Course Content: Build It and They Might Come , 2012, J. Inf. Technol. Educ. Res..

[10]  Thomas Fang Zheng,et al.  Noisy training for deep neural networks in speech recognition , 2015, EURASIP Journal on Audio, Speech, and Music Processing.

[11]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2017 .

[12]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Sameer Kulkarni,et al.  Multimedia Keyword Spotting (MKWS) Using Training And Template Based Techniques , 2014 .

[14]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[15]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[16]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.