Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-the-art classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.

[1]  I. Elamvazuthi,et al.  Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques , 2010, ArXiv.

[2]  Danfei Xu,et al.  PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Christopher Zach,et al.  A dynamic programming approach for fast and robust object pose recognition from range images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Marcus Liwicki,et al.  Scene labeling with LSTM recurrent neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  M. Mattson Superior pattern processing is the essence of the evolved human brain , 2014, Front. Neurosci..

[7]  Anikó Ekárt,et al.  Evolutionary Optimisation of Fully Connected Artificial Neural Network Topology , 2019 .

[8]  Stefan Roth,et al.  Tree-Structured Models for Efficient Multi-Cue Scene Labeling , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  C.-C. Jay Kuo,et al.  Where am I? Scene Recognition for Mobile Robots using Audio Features , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[10]  Yoji Kuroda,et al.  Dynamic Environment Recognition for Autonomous Navigation with Wide FOV 3D-LiDAR⁎ , 2018, SyRoCo.

[11]  Aurélien Mayoue,et al.  Deep neural networks for audio scene recognition , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[12]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[13]  Mark T. Keane,et al.  Cognitive Psychology: A Student's Handbook , 1990 .

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Jianxiong Xiao,et al.  Semantic alignment of LiDAR data at city scale , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Shuicheng Yan,et al.  Hybrid CNN and Dictionary-Based Models for Scene Recognition and Domain Adaptation , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[17]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[18]  Hang Zhang,et al.  Deep Texture Manifold for Ground Terrain Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Martin Lauer,et al.  3D Traffic Scene Understanding From Movable Platforms , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Vesa T. Peltonen,et al.  Computational auditory scene recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Norbert Dillier,et al.  Sound Classification in Hearing Aids Inspired by Auditory Scene Analysis , 2005, EURASIP J. Adv. Signal Process..

[22]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[23]  Anikó Ekárt,et al.  From Simulation to Reality: CNN Transfer Learning for Scene Classification , 2020, 2020 IEEE 10th International Conference on Intelligent Systems (IS).