论文信息 - Using Sequential Information in Polyphonic Sound Event Detection

Using Sequential Information in Polyphonic Sound Event Detection

To detect the class, and start and end times of sound events in real world recordings is a challenging task. Current computer systems often show relatively high frame-wise accuracy but low event-wise accuracy. In this paper, we attempted to merge the gap by explicitly including sequential information to improve the performance of a state-of-the-art polyphonic sound event detection system. We propose to 1) use delayed predictions of event activities as additional input features that are fed back to the neural network; 2) build N-grams to model the co-occurrence probabilities of different events; 3) use se-quentialloss to train neural networks. Our experiments on a corpus of real world recordings show that the N-grams could smooth the spiky output of a state-of-the-art neural network system, and improve both the frame-wise and the event-wise metrics.

Tuomas Virtanen | Toni Heittola | Guangpu Huang

[1] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2] Lori Lamel,et al. An investigation into language model data augmentation for low-resourced STT and KWS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[4] Tuomas Virtanen,et al. Assessment of human and machine performance in acoustic scene classification: Dcase 2016 case study , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[5] Florian Metze,et al. A first attempt at polyphonic sound event detection using connectionist temporal classification , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Tomoki Toda,et al. Duration-Controlled LSTM for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7] Mathieu Lagrange,et al. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), Budapest, Hungary, 3 Sep 2016. , 2016 .

[8] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[9] Dan Stowell,et al. Approaches to Complex Sound Scene Analysis , 2018 .

[10] Annamaria Mesaros,et al. Metrics for Polyphonic Sound Event Detection , 2016 .

[11] Simon Dixon,et al. An End-to-End Neural Network for Polyphonic Piano Music Transcription , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12] John Salvatier,et al. Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[13] Jean-Luc Gauvain,et al. Machine translation based data augmentation for Cantonese keyword spotting , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Tara N. Sainath,et al. FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[15] Heikki Huttunen,et al. Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Andreas Stolcke,et al. Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[17] Florian Metze,et al. Audio-based multimedia event detection using deep recurrent neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Heikki Huttunen,et al. Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19] Jen-Tzung Chien,et al. Joint acoustic and language modeling for speech recognition , 2010, Speech Commun..