Speech Evaluation Based on Deep Learning Audio Caption

Speech evaluation is an essential process of language learning. Traditionally, speech evaluation is done by experts evaluate voice and pronunciation from testers, which lack of efficiency and standards. In this paper, we propose a novel approach, based on deep learning and audio caption, to evaluate speeches instead of linguistic experts. First, the proposed approach extracts audio features from the speech. Then, the relationships between audio features expert evaluations are learned by deep learning. At last, an LSTM model is applied to predict expert evaluations. The experiment is done in a real-world dataset collected by our collaborative company. The result shows the proposed approach achieves excellent performance and has high potentials in the application.

[1]  Florin Adrian Bulgarov,et al.  Proposition Entailment in Educational Applications Using Deep Neural Networks , 2018, AAAI.

[2]  Ian McGraw,et al.  On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[4]  Quoc V. Le,et al.  Massive Exploration of Neural Machine Translation Architectures , 2017, EMNLP.

[5]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[6]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[7]  Yangyang Shi,et al.  End-to-end Speech Recognition Using a High Rank LSTM-CTC Based Model , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Jen-Tzung Chien,et al.  Deep recurrent regularization neural network for speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[10]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[11]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[12]  Seyed Omid Sadjadi,et al.  The IBM 2016 Speaker Recognition System , 2016, Odyssey.

[13]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[14]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[15]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Lei Xie,et al.  Attention-based End-to-End Models for Small-Footprint Keyword Spotting , 2018, INTERSPEECH.

[17]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[20]  Xi Zhou,et al.  Cascaded CNN-resBiLSTM-CTC: An End-to-End Speech Recognition Acoustic Model , 2018 .