Generating Sound Words from Audio Signals of Acoustic Events with Sequence-to-Sequence Model

Representing various sounds in language, such as sound words, or onomatopoeias, is not only useful as an auxiliary means for automatic speech recognition, but also essential in emerging fields such as natural human-machine communication, searching audio archives for acoustic events, and abnormality detection based on sounds. This paper proposes a novel method for sound word generation from audio signals. The method is based on an end-to-end, sequence-to-sequence framework to solve the audio segmentation problem to find an appropriate segment of audio signals along time that corresponds to a sequence of phonemes, and the ambiguity problem, where multiple words may correspond to the same sound, depending on the situations or listeners. Our tests show that the method worked efficiently and achieved a 2.8% mean phoneme error rate (MPER) and a 7.2% word error rate (WER) in a sound word generation task.

[1]  Hervé Bourlard,et al.  Connectionist probability estimators in HMM speech recognition , 1994, IEEE Trans. Speech Audio Process..

[2]  Satoshi Nakamura,et al.  Data collection in real acoustical environments for sound scene understanding and hands-free speech recognition , 1999, EUROSPEECH.

[3]  Hiroshi G. Okuno,et al.  Automatic transformation of environmental sounds into sound-imitation words based on Japanese syllable structure , 2003, INTERSPEECH.

[4]  Renate Sitte,et al.  Comparison of techniques for environmental sound recognition , 2003, Pattern Recognit. Lett..

[5]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[6]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[7]  Shrikanth S. Narayanan,et al.  Classification of sound clips by two schemes: Using onomatopoeia and semantic labels , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[8]  T. Nishiura,et al.  The Acoustic Sound Field Dictation with Hidden Markov Model Based on an Onomatopeia , 2010 .

[9]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[10]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[12]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Kunio Kashino,et al.  Visualizing Video Sounds With Sound Word Animation to Enrich User Experience , 2017, IEEE Transactions on Multimedia.

[16]  Yu Zhang,et al.  Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).