Generative Adversarial Network-Based Neural Audio Caption Model for Oral Evaluation

Oral evaluation is one of the most critical processes in children’s language learning. Traditionally, the Scoring Rubric is widely used in oral evaluation for providing a ranking score by assessing word accuracy, phoneme accuracy, fluency, and accent position of a tester. In recent years, by the emerging demands of the market, oral evaluation requires not only providing a single score from pronunciation but also in-depth, meaning comments based on content, context, logic, and understanding. However, the Scoring Rubric requires massive human work (oral evaluation experts) to provide such deep meaning comments. It is considered uneconomical and inefficient in the current market. Therefore, this paper proposes an automated expert comment generation approach for oral evaluation. The approach first extracts the oral features from the children’s audio as well as the text features from the corresponding expert comments. Then, a Gated Recurrent Unit (GRU) is applied to encode the oral features into the model. Afterwards, a Long Short-Term Memory (LSTM) model is applied to train the mappings between oral features and text features and generate expert comments for the new coming oral audio. Finally, a Generative Adversarial Network (GAN) is combined to improve the quality of the generated comments. It generates pseudo-comments to train the discriminator to recognize the human-like comments. The proposed approach is evaluated in a real-world audio dataset (children oral audio) collected by our collaborative company. The proposed approach is also integrated into a commercial application to generate expert comments for children’s oral evaluation. The experimental results and the lessons learned from real-world applications show that the proposed approach is effective for providing meaningful comments for oral evaluation.

[1]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[2]  T. Jaeger,et al.  Big data suggest strong constraints of linguistic similarity on adult language learning , 2019, Cognition.

[3]  Jing Yang,et al.  3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition , 2018, IEEE Signal Processing Letters.

[4]  Jun Guo,et al.  Short Utterance Based Speech Language Identification in Intelligent Vehicles With Time-Scale Modifications and Deep Bottleneck Features , 2019, IEEE Transactions on Vehicular Technology.

[5]  Emiel Krahmer,et al.  Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Savitha S. Upadhya,et al.  Discriminating Parkinson diseased and healthy people using modified MFCC filter bank approach , 2019, Int. J. Speech Technol..

[8]  Mohd Shahrizal Sunar,et al.  Using augmented reality with speech input for non-native children's language learning , 2020, Int. J. Hum. Comput. Stud..

[9]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[10]  Chao Chen,et al.  Enjoy the most beautiful scene now: a memetic algorithm to solve two-fold time-dependent arc orienteering problem , 2019, Frontiers of Computer Science.

[11]  Amir H. Gandomi,et al.  Hash polynomial two factor decision tree using IoT for smart health care scheduling , 2020, Expert Syst. Appl..

[12]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[13]  Prasanna V. Kothalkar,et al.  Automatic prediction of intelligible speaking rate for individuals with ALS from speech acoustic and articulatory samples , 2018, International journal of speech-language pathology.

[14]  Fagui Liu,et al.  Combining attention-based bidirectional gated recurrent neural network and two-dimensional convolutional neural network for document-level sentiment classification , 2020, Neurocomputing.

[15]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).