A Study of Features and Deep Neural Network Architectures and Hyper-Parameters for Domestic Audio Classification

Recent methodologies for audio classification frequently involve cepstral and spectral features, applied to single channel recordings of acoustic scenes and events. Further, the concept of transfer learning has been widely used over the years, and has proven to provide an efficient alternative to training neural networks from scratch. The lower time and resource requirements when using pre-trained models allows for more versatility in developing system classification approaches. However, information on classification performance when using different features for multi-channel recordings is often limited. Furthermore, pre-trained networks are initially trained on bigger databases and are often unnecessarily large. This poses a challenge when developing systems for devices with limited computational resources, such as mobile or embedded devices. This paper presents a detailed study of the most apparent and widely-used cepstral and spectral features for multi-channel audio applications. Accordingly, we propose the use of spectro-temporal features. Additionally, the paper details the development of a compact version of the AlexNet model for computationally-limited platforms through studies of performances against various architectural and parameter modifications of the original network. The aim is to minimize the network size while maintaining the series network architecture and preserving the classification accuracy. Considering that other state-of-the-art compact networks present complex directed acyclic graphs, a series architecture proposes an advantage in customizability. Experimentation was carried out through Matlab, using a database that we have generated for this task, which composes of four-channel synthetic recordings of both sound events and scenes. The top performing methodology resulted in a weighted F1-score of 87.92% for scalogram features classified via the modified AlexNet-33 network, which has a size of 14.33 MB. The AlexNet network returned 86.24% at a size of 222.71 MB.

[1]  Zheng Fang,et al.  Comparison of different implementations of MFCC , 2001 .

[2]  Gaël Richard,et al.  Acoustic Features for Environmental Sound Analysis , 2018 .

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Jörn Anemüller,et al.  Spectro-Temporal Gabor Filterbank Features for Acoustic Event Detection , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Somaya Al-Máadeed,et al.  Automatic Detection and Classification of Audio Events for Road Surveillance Applications , 2018, Sensors.

[6]  Michael K. Weir,et al.  A method for self-determination of adaptive learning rates in back propagation , 1991, Neural Networks.

[7]  Huy Phan,et al.  Improved Audio Scene Classification Based on Label-Tree Embeddings and Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Weiping Zheng,et al.  CNNs-based Acoustic Scene Classification using Multi-Spectrogram Fusion and Label Expansions , 2018, ArXiv.

[9]  Hirokazu Kameoka,et al.  Consistent Wiener Filtering: Generalized Time-Frequency Masking Respecting Spectrogram Consistency , 2010, LVA/ICA.

[10]  Keisuke Imoto,et al.  Introduction to acoustic event and scene analysis , 2018 .

[11]  Dariusz Komorowski,et al.  The Use of Continuous Wavelet Transform Based on the Fast Fourier Transform in the Analysis of Multi-channel Electrogastrography Recordings , 2015, Journal of Medical Systems.

[12]  James Jin Kang,et al.  Classification of Skin Disease Using Deep Learning Neural Networks with MobileNet V2 and LSTM , 2021, Sensors.

[13]  Lu Lu,et al.  Dying ReLU and Initialization: Theory and Numerical Examples , 2019, Communications in Computational Physics.

[14]  Jonathan Le Roux,et al.  Phase Processing for Single-Channel Speech Enhancement: History and recent advances , 2015, IEEE Signal Processing Magazine.

[15]  Myo Taeg Lim,et al.  Emotion Recognition Using Convolutional Neural Network with Selected Statistical Photoplethysmogram Features , 2020, Applied Sciences.

[16]  Anamaria Radoi,et al.  Complex Neural Networks for Estimating Epicentral Distance, Depth, and Magnitude of Seismic Waves , 2022, IEEE Geoscience and Remote Sensing Letters.

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  Stelios M. Potirakis,et al.  A Two-Level Sound Classification Platform for Environmental Monitoring , 2018, J. Sensors.

[19]  Goutam Saha,et al.  Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition , 2012, Speech Commun..

[20]  LJubisa Stankovic,et al.  Quantitative Performance Analysis of Scalogram as Instantaneous Frequency Estimator , 2008, IEEE Transactions on Signal Processing.

[21]  Nurbaity Sabri,et al.  Evaluation of Pre-Trained Convolutional Neural Network Models for Object Recognition , 2018, International Journal of Engineering & Technology.

[22]  Yibin Li,et al.  The Influence of the Activation Function in a Convolution Neural Network Model of Facial Expression Recognition , 2020, Applied Sciences.

[23]  Vadim V. Romanuke An Efficient Technique for Size Reduction of Convolutional Neural Networks after Transfer Learning for Scene Recognition Tasks , 2018, Appl. Comput. Syst..

[24]  Son Lam Phung,et al.  Learning Pattern Classification Tasks with Imbalanced Data Sets , 2009 .

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Hemantha Kumar Kalluri,et al.  Deep learning and transfer learning approaches for image classification , 2019 .

[27]  Shaohuai Shi,et al.  Speeding up Convolutional Neural Networks By Exploiting the Sparsity of Rectifier Units , 2017, ArXiv.