Detection and Classification of Acoustic Scenes and Events 2018 Challenge PARTIALLY-SHARED CONVOLUTIONAL NEURAL NETWORK FOR CLASSIFICATION OF MULTI-CHANNEL RECORDED AUDIO SIGNALS Technical Report

This technical paper presents the system used in our submission for task 5 of the DCASE 2018 challenge [1]. We proposed a partiallyshared convolutional neural network, which is a multi-task system that contains a common input (the multi-channel log Mel features) and two output branches, a classification branch, which outputs the predicted class, and a regression branch, which outputs a singlechannel representation of the multi-channel input data. Since the system has a shared network between classification and regression, training for regression is expected to enhance another training for classification and vice versa. Because task 5 aims at classification based on multi-channel audio input, we tried to improve classification performance with this system by training classification and regression together. By applying the proposed system incorporated with parameter tuning of the baseline CNN system, we confirmed that the classification F1 score increased to 89.94% in four-fold cross validation, while the baseline system achieved 84.50% .

[1]  B. A. D. H. Brandwood A complex gradient operator and its applica-tion in adaptive array theory , 1983 .

[2]  Hiroshi Sawada,et al.  Polar coordinate based nonlinear function for frequency-domain blind source separation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Christopher V. Alvino,et al.  Geometric source separation: merging convolutive source separation with geometric beamforming , 2001, Neural Networks for Signal Processing XI: Proceedings of the 2001 IEEE Signal Processing Society Workshop (IEEE Cat. No.01TH8584).

[4]  Jean Rouat,et al.  Robust Recognition of Simultaneous Speech by a Mobile Robot , 2007, IEEE Transactions on Robotics.

[5]  Kazuhiro Nakadai,et al.  Correlation matrix estimation by an optimally controlled recursive average method and its application to blind source separation , 2010 .

[6]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[7]  Hiroshi G. Okuno,et al.  Robot audition: Its rise and perspectives , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Kazuhiro Nakadai,et al.  Partially Shared Deep Neural Network in sound source separation and identification using a UAV-embedded microphone array , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[11]  Hiroshi G. Okuno,et al.  Development, Deployment and Applications of Robot Audition Open Source Software HARK , 2017, J. Robotics Mechatronics.