Detecting Sounds of Interest in Roads with Deep Networks

Monitoring of public and private places is of great importance for security of people and is usually done by means of surveillance cameras. In this paper we propose an approach for monitoring of roads, to detect car crashes and tire skidding, based on the analysis of sound signals, which can complement or, in some cases, substitute video analytic systems. The system that we propose employs a MobileNet deep architecture, designed to efficiently run on embedded appliances and be deployed on distributed systems for road monitoring. We designed a recognition system based on analysis of audio frames and tested it on the publicly available MIVIA road events data set. The performance results that we achieved (recognition rate higher than \(99\%\)) are higher than existing methods, demonstrating that the proposed approach can be deployed on embedded devices in a distributed surveillance system.

[1]  Climent Nadeu,et al.  Time and frequency filtering of filter-bank energies for robust HMM speech recognition , 2000, Speech Commun..

[2]  Chloé Clavel,et al.  Events Detection for an Audio-Based Surveillance System , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[3]  Murat Akbacak,et al.  Bag-of-Audio-Words Approach for Multimedia Event Classification , 2012, INTERSPEECH.

[4]  Pan Zhou,et al.  Spatial Pyramid Pooling Mechanism in 3D Convolutional Network for Sentence-Level Classification , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Brendan Tran Morris,et al.  Looking at Intersections: A Survey of Intersection Monitoring, Behavior and Safety Analysis of Recent Studies , 2017, IEEE Transactions on Intelligent Transportation Systems.

[6]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[9]  João Paulo da Silva Neto,et al.  Non-speech audio event detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Huy Phan,et al.  Comparing time and frequency domain for audio event recognition using deep learning , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[11]  Luc Van Gool,et al.  AENet: Learning Deep Audio Features for Video Analysis , 2017, IEEE Transactions on Multimedia.

[12]  Paolo Napoletano,et al.  Benchmark Analysis of Representative Deep Neural Network Architectures , 2018, IEEE Access.

[13]  Nicolai Petkov,et al.  Time-frequency analysis for audio event detection in real scenarios , 2016, 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[14]  Nicolai Petkov,et al.  Learning sound representations using trainable COPE feature extractors , 2019, Pattern Recognit..

[15]  Dan Stowell,et al.  A database and challenge for acoustic scene classification and event detection , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[16]  Guodong Guo,et al.  Content-based audio classification and retrieval by support vector machines , 2003, IEEE Trans. Neural Networks.

[17]  Alessia Saggese,et al.  Combining Neural Networks and Fuzzy Systems for Human Behavior Understanding , 2012, 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance.

[18]  Alessia Saggese,et al.  Cascade classifiers trained on gammatonegrams for reliably detecting audio events , 2014, 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[19]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[20]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[21]  Vittorio Murino,et al.  Audio Surveillance , 2014, ACM Comput. Surv..

[22]  Nicolai Petkov,et al.  Audio Surveillance of Roads: A System for Detecting Anomalous Sounds , 2016, IEEE Transactions on Intelligent Transportation Systems.

[23]  Alessandro Neri,et al.  Enhancing audio surveillance with hierarchical recurrent neural networks , 2017, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[24]  Nicolai Petkov,et al.  Reliable detection of audio events in highly noisy environments , 2015, Pattern Recognit. Lett..

[25]  Alessia Saggese,et al.  Audio surveillance using a bag of aural words classifier , 2013, 2013 10th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[26]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Nicolai Petkov,et al.  Car crashes detection by audio analysis in crowded roads , 2015, 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[28]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Nicolai Petkov,et al.  Learning skeleton representations for human action recognition , 2019, Pattern Recognition Letters.

[30]  Alessia Saggese,et al.  Dynamic Scene Understanding for Behavior Analysis Based on String Kernels , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[31]  Alessia Saggese,et al.  Exploiting the deep learning paradigm for recognizing human actions , 2014, 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[32]  Yu Tsao,et al.  Sparse representation based on a bag of spectral exemplars for acoustic event detection , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Chng Eng Siong,et al.  Image Feature Representation of the Subband Power Distribution for Robust Sound Event Classification , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Zhouyu Fu,et al.  A Survey of Audio-Based Music Classification and Annotation , 2011, IEEE Transactions on Multimedia.

[36]  Asma Rabaoui,et al.  Using One-Class SVMs and Wavelets for Audio Surveillance , 2008, IEEE Transactions on Information Forensics and Security.

[37]  Nicolai Petkov,et al.  Bio-Inspired Filters for Audio Analysis , 2015, BrainComp.

[38]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[40]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[41]  Wen Gao,et al.  Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[42]  John Robinson,et al.  Environmental Sound Recognition Using Masked Conditional Neural Networks , 2017, ADMA.

[43]  François Pachet,et al.  The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. , 2007, The Journal of the Acoustical Society of America.

[44]  Mohan S. Kankanhalli,et al.  Audio Based Event Detection for Multimedia Surveillance , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.