Audio-Based Event Detection at Different SNR Settings Using Two-Dimensional Spectrogram Magnitude Representations

Audio-based event detection poses a number of different challenges that are not encountered in other fields, such as image detection. Challenges such as ambient noise, low Signal-to-Noise Ratio (SNR) and microphone distance are not yet fully understood. If the multimodal approaches are to become better in a range of fields of interest, audio analysis will have to play an integral part. Event recognition in autonomous vehicles (AVs) is such a field at a nascent stage that can especially leverage solely on audio or can be part of the multimodal approach. In this manuscript, an extensive analysis focused on the comparison of different magnitude representations of the raw audio is presented. The data on which the analysis is carried out is part of the publicly available MIVIA Audio Events dataset. Single channel Short-Time Fourier Transform (STFT), mel-scale and Mel-Frequency Cepstral Coefficients (MFCCs) spectrogram representations are used. Furthermore, aggregation methods of the aforementioned spectrogram representations are examined; the feature concatenation compared to the stacking of features as separate channels. The effect of the SNR on recognition accuracy and the generalization of the proposed methods on datasets that were both seen and not seen during training are studied and reported.

[1]  Bernhard Rinner,et al.  Distributed embedded smart cameras for surveillance applications , 2006, Computer.

[2]  Dimitrios Tzovaras,et al.  Audio content analysis for unobtrusive event detection in smart homes , 2020, Eng. Appl. Artif. Intell..

[3]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[4]  Felix Becker,et al.  Cost-based analysis of autonomous mobility services , 2017 .

[5]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Chung-Hsien Wu,et al.  Sound Event Recognition Using Auditory-Receptive-Field Binary Pattern and Hierarchical-Diving Deep Belief Network , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Zheng Liu,et al.  RGB-D-Based Object Recognition Using Multimodal Convolutional Neural Networks: A Survey , 2019, IEEE Access.

[8]  Vishal M. Patel,et al.  A Survey of Recent Advances in CNN-based Single Image Crowd Counting and Density Estimation , 2017, Pattern Recognit. Lett..

[9]  Tomi Räty,et al.  Survey on Contemporary Remote Surveillance Systems for Public Safety , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[10]  Alessia Saggese,et al.  Dynamic Scene Understanding for Behavior Analysis Based on String Kernels , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[11]  Luc Van Gool,et al.  AENet: Learning Deep Audio Features for Video Analysis , 2017, IEEE Transactions on Multimedia.

[12]  Zhouyu Fu,et al.  A Survey of Audio-Based Music Classification and Annotation , 2011, IEEE Transactions on Multimedia.

[13]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[14]  Adel Said Elmaghraby,et al.  Cyber security challenges in Smart Cities: Safety, security and privacy , 2014, Journal of advanced research.

[15]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[16]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[17]  Dimitrios Tzovaras,et al.  Managing Spatial Graph Dependencies in Large Volumes of Traffic Data for Travel-Time Prediction , 2016, IEEE Transactions on Intelligent Transportation Systems.

[18]  Alessia Saggese,et al.  AReN: A Deep Learning Approach for Sound Event Recognition Using a Brain Inspired Representation , 2020, IEEE Transactions on Information Forensics and Security.

[19]  Nicolai Petkov,et al.  Learning sound representations using trainable COPE feature extractors , 2019, Pattern Recognit..

[20]  Nicolai Petkov,et al.  Reliable detection of audio events in highly noisy environments , 2015, Pattern Recognit. Lett..

[21]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[22]  Sergio A. Velastin,et al.  Intelligent distributed surveillance systems: a review , 2005 .

[23]  M F Sanner,et al.  Python: a programming language for software integration and development. , 1999, Journal of molecular graphics & modelling.

[24]  Vittorio Murino,et al.  Audio Surveillance , 2014, ACM Comput. Surv..

[25]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[26]  Sébastien Marcel,et al.  A Fast Parts-Based Approach to Speaker Verification Using Boosted Slice Classifiers , 2012, IEEE Transactions on Information Forensics and Security.

[27]  Nicolai Petkov,et al.  Audio Surveillance of Roads: A System for Detecting Anomalous Sounds , 2016, IEEE Transactions on Intelligent Transportation Systems.

[28]  Andrzej Czyzewski,et al.  Detection, classification and localization of acoustic events in the presence of background noise for acoustic surveillance of hazardous situations , 2015, Multimedia Tools and Applications.