Sound Event Detection: A tutorial

Imagine standing on a street corner in the city. With your eyes closed you can hear and recognize a succession of sounds: cars passing by, people speaking, their footsteps when they walk by, and the continuous falling of rain. The recognition of all these sounds and interpretation of the perceived scene as a city street soundscape comes naturally to humans. It is, however, the result of years of "training": encountering and learning associations among the vast varieties of sounds in everyday life, the sources producing these sounds, and the names given to them.

[1]  Archontis Politis,et al.  A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection , 2020, DCASE.

[2]  Aren Jansen,et al.  Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking , 2020, IEEE Signal Processing Letters.

[3]  Ian McLoughlin,et al.  Task-Aware Mean Teacher Method for Large Scale Weakly Labeled Semi-Supervised Sound Event Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  T. Virtanen,et al.  Active Learning for Sound Event Detection , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  S. Krstulovic,et al.  A Framework for the Robust Evaluation of Sound Event Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Dorothea Kolossa,et al.  Joining Sound Event Detection and Localization Through Spatial Segregation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Timo Baumann,et al.  The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening , 2019, Lang. Resour. Evaluation.

[8]  Justin Salamon,et al.  Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Seokwon Jung,et al.  Polyphonic Sound Event Detection Using Convolutional Bidirectional Lstm and Synthetic Data-based Transfer Learning , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tuomas Virtanen,et al.  Zero-Shot Audio Classification Based On Class Label Embeddings , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[11]  Emmanuel Vincent,et al.  Sound Event Detection in the DCASE 2017 Challenge , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Daniel P. W. Ellis,et al.  Learning Sound Event Classifiers from Web Audio with Noisy Labels , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Simon Dixon,et al.  Automatic Music Transcription: An Overview , 2019, IEEE Signal Processing Magazine.

[14]  Tuomas Virtanen,et al.  End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[15]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Daniel P. W. Ellis,et al.  Datasets and Evaluation , 2018 .

[17]  Catherine Guastavino,et al.  Everyday Sound Categorization , 2018 .

[18]  Gerhard Widmer,et al.  Training general-purpose audio tagging networks with noisy labels and iterative self-verification , 2018, DCASE.

[19]  Brian McFee,et al.  Statistical Methods for Scene and Event Classification , 2018 .

[20]  Sacha Krstulović,et al.  Audio Event Recognition in the Smart Home , 2018 .

[21]  Gaël Richard,et al.  Acoustic Features for Environmental Sound Analysis , 2018 .

[22]  Justin Salamon,et al.  Scaper: A library for soundscape synthesis and augmentation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[23]  Mark D. Plumbley,et al.  Computational Analysis of Sound Scenes and Events , 2017 .

[24]  Gaël Richard,et al.  Feature Learning With Matrix Factorization Applied to Acoustic Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Qiang Huang,et al.  Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging , 2017, INTERSPEECH.

[27]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Tuomas Virtanen,et al.  Sound event detection using spatial features and convolutional recurrent neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[30]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[32]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[34]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[35]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[36]  Bhiksha Raj,et al.  Weakly supervised scalable audio content analysis , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[37]  Heikki Huttunen,et al.  Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Meinard Mller,et al.  Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications , 2015 .

[39]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[40]  Justin Salamon,et al.  Unsupervised feature learning for urban sound classification , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[42]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[43]  Tuomas Virtanen,et al.  Context-dependent sound event detection , 2013, EURASIP Journal on Audio, Speech, and Music Processing.

[44]  Annamaria Mesaros,et al.  Sound Event Detection in Multisource Environments Using Source Separation , 2011 .

[45]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[46]  Derry Fitzgerald,et al.  Harmonic/Percussive Separation Using Median Filtering , 2010 .

[47]  Mert Bay,et al.  Evaluation of Multiple-F0 Estimation and Tracking Systems , 2009, ISMIR.

[48]  Brian Gygi,et al.  Environmental sound research as it stands today , 2007 .

[49]  William W. Gaver What in the World Do We Hear? An Ecological Approach to Auditory Event Perception , 1993 .