论文信息 - Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-the-art classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.

[1] I. Elamvazuthi,et al. Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques , 2010, ArXiv.

[2] Danfei Xu,et al. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4] Christopher Zach,et al. A dynamic programming approach for fast and robust object pose recognition from range images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Marcus Liwicki,et al. Scene labeling with LSTM recurrent neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] M. Mattson. Superior pattern processing is the essence of the evolved human brain , 2014, Front. Neurosci..

[7] Anikó Ekárt,et al. Evolutionary Optimisation of Fully Connected Artificial Neural Network Topology , 2019 .

[8] Stefan Roth,et al. Tree-Structured Models for Efficient Multi-Cue Scene Labeling , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9] C.-C. Jay Kuo,et al. Where am I? Scene Recognition for Mobile Robots using Audio Features , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[10] Yoji Kuroda,et al. Dynamic Environment Recognition for Autonomous Navigation with Wide FOV 3D-LiDAR⁎ , 2018, SyRoCo.

[11] Aurélien Mayoue,et al. Deep neural networks for audio scene recognition , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[12] Bolei Zhou,et al. Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[13] Mark T. Keane,et al. Cognitive Psychology: A Student's Handbook , 1990 .

[14] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15] Jianxiong Xiao,et al. Semantic alignment of LiDAR data at city scale , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Shuicheng Yan,et al. Hybrid CNN and Dictionary-Based Models for Scene Recognition and Domain Adaptation , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[17] E. B. Newman,et al. A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[18] Hang Zhang,et al. Deep Texture Manifold for Ground Terrain Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19] Martin Lauer,et al. 3D Traffic Scene Understanding From Movable Platforms , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20] Vesa T. Peltonen,et al. Computational auditory scene recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21] Norbert Dillier,et al. Sound Classification in Hearing Aids Inspired by Auditory Scene Analysis , 2005, EURASIP J. Adv. Signal Process..

[22] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[23] Anikó Ekárt,et al. From Simulation to Reality: CNN Transfer Learning for Scene Classification , 2020, 2020 IEEE 10th International Conference on Intelligent Systems (IS).