Incremental Audio-Visual Fusion for Person Recognition in Earthquake Scene

Earthquakes have a profound impact on social harmony and property, resulting in damage to buildings and infrastructure. Effective earthquake rescue efforts require rapid and accurate determination of whether any survivors are trapped in the rubble of collapsed buildings. While deep learning algorithms can enhance the speed of rescue operations using single-modal data (either visual or audio), they are confronted with two primary challenges: insufficient information provided by single-modal data and catastrophic forgetting. In particular, the complexity of earthquake scenes means that single-modal features may not provide adequate information. Additionally, catastrophic forgetting occurs when the model loses the information learned in a previous task after training on subsequent tasks, due to non-stationary data distributions in changing earthquake scenes. To address these challenges, we propose an innovative approach that utilizes an incremental audio-visual fusion model for person recognition in earthquake rescue scenarios. Firstly, we leverage a cross-modal hybrid attention network to capture discriminative temporal context embedding, which uses self-attention and cross-modal attention mechanisms to combine multi-modality information, enhancing the accuracy and reliability of person recognition. Secondly, an incremental learning model is proposed to overcome catastrophic forgetting, which includes elastic weight consolidation and feature replay modules. Specifically, the elastic weight consolidation module slows down learning on certain weights based on their importance to previously learned tasks. The feature replay module reviews the learned knowledge by reusing the features conserved from the previous task, thus preventing catastrophic forgetting in dynamic environments. To validate the proposed algorithm, we collected the Audio-Visual Earthquake Person Recognition dataset (AVEPR) from earthquake films and real scenes. Furthermore, the proposed method gets 85.41% accuracy while learning the 10th new task, which demonstrates the effectiveness of the proposed method and highlights its potential to significantly improve earthquake rescue efforts.

[1]  Zhou Zhao,et al.  Cross-modal Background Suppression for Audio-Visual Event Localization , 2022, Computer Vision and Pattern Recognition.

[2]  Xin Wang,et al.  MEmoR: A Dataset for Multimodal Emotion Reasoning in Videos , 2020, ACM Multimedia.

[3]  Chenliang Xu,et al.  Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing , 2020, ECCV.

[4]  Ángel M. Gómez,et al.  A Light Convolutional GRU-RNN Deep Feature Extractor for ASV Spoofing Detection , 2019, INTERSPEECH.

[5]  Mateusz Matuszewski,et al.  Robust Bayesian and Light Neural Networks for Voice Spoofing Detection , 2019, INTERSPEECH.

[6]  Ling Shao,et al.  Random Path Selection for Incremental Learning , 2019, ArXiv.

[7]  Xingyi Zhou,et al.  Objects as Points , 2019, ArXiv.

[8]  Chuang Gan,et al.  The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Shan Yu,et al.  Continual learning of context-dependent processing in neural networks , 2018, Nature Machine Intelligence.

[10]  Vidhyasaharan Sethu,et al.  Deep Siamese Architecture Based Replay Detection for Secure Voice Biometric , 2018, INTERSPEECH.

[11]  Hei Law,et al.  CornerNet: Detecting Objects as Paired Keypoints , 2018, International Journal of Computer Vision.

[12]  Shifeng Zhang,et al.  Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd , 2018, ECCV.

[13]  Patrick Pérez,et al.  Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events , 2018, CVPR Workshops.

[14]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[15]  Shu Liu,et al.  Path Aggregation Network for Instance Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Joost van de Weijer,et al.  Rotate your Networks: Better Weight Consolidation and Less Catastrophic Forgetting , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[17]  Alexandros Karatzoglou,et al.  Overcoming catastrophic forgetting with hard attention to the task , 2018, ICML.

[18]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Marcus Rohrbach,et al.  Memory Aware Synapses: Learning what (not) to forget , 2017, ECCV.

[20]  Yuning Jiang,et al.  Repulsion Loss: Detecting Pedestrians in a Crowd , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Yan Liu,et al.  Deep Generative Dual Memory Network for Continual Learning , 2017, ArXiv.

[22]  Ben P. Milner,et al.  Generating Intelligible Audio Speech From Visual Speech , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Sung Ju Hwang,et al.  Lifelong Learning with Dynamically Expandable Networks , 2017, ICLR.

[24]  Jiwon Kim,et al.  Continual Learning with Deep Generative Replay , 2017, NIPS.

[25]  Byoung-Tak Zhang,et al.  Overcoming Catastrophic Forgetting by Incremental Moment Matching , 2017, NIPS.

[26]  Maja Pantic,et al.  End-to-end visual speech recognition with LSTMS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Andrei A. Rusu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[28]  Christoph H. Lampert,et al.  iCaRL: Incremental Classifier and Representation Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[30]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Jürgen Schmidhuber,et al.  Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[35]  Tieniu Tan,et al.  A Light CNN for Deep Face Representation With Noisy Labels , 2015, IEEE Transactions on Information Forensics and Security.

[36]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[38]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[40]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[41]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[42]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[43]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[44]  Tat-Seng Chua,et al.  Fusion of AV features and external information sources for event detection in team sports video , 2006, TOMCCAP.

[45]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[46]  Alan Hanjalic,et al.  Generating Images From Spoken Descriptions , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[47]  Michael I. Jordan,et al.  Decision-Making with Auto-Encoding Variational Bayes , 2020, NeurIPS.