论文信息 - Attention-driven Multi-sensor Selection

Attention-driven Multi-sensor Selection

Recent encoder-decoder models for sequence-to-sequence mapping show that integrating both temporal and spatial attention mechanisms into neural networks considerably improve network performance. The use of attention for sensor selection in multi-sensor setups and the benefit of such an attention mechanism is less studied. This work reports on a sensor transformation attention network (STAN) that embeds a sensory attention mechanism to dynamically weigh and combine individual input sensors based on their task-relevant information. We demonstrate the correlation of the attentional signal to changing noise levels of each sensor on the audio-visual GRID dataset and synthetic noise; and on CHiME-4, a multi-microphone real-world noisy dataset. In addition, we demonstrate that the STAN model is able to deal with sensor removal and addition without retraining, and is invariant to channel order. Compared to a two-sensor model that weighs both sensors equally, the equivalent STAN model has a relative parameter increase of only 0.09%, but reduces the relative character error rate (CER) by up to 19.1% on the CHiME-4 dataset. The attentional signal helps to identify a lower SNR sensor with up to 94.2% accuracy.

[1] Sepp Hochreiter,et al. Self-Normalizing Neural Networks , 2017, NIPS.

[2] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3] Hiroaki Kitano,et al. Active audition system and humanoid exterior design , 2000, Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000) (Cat. No.00CH37113).

[4] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Stefanos Zafeiriou,et al. 300 Faces In-The-Wild Challenge: database and results , 2016, Image Vis. Comput..

[6] Ian Lane,et al. End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition , 2017, INTERSPEECH.

[7] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[8] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[9] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[10] Christian Wolf,et al. Modout: Learning to Fuse Modalities via Stochastic Regularization , 2016 .

[11] Hiroshi G. Okuno,et al. Robot audition: Its rise and perspectives , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] B. V. K. Vijaya Kumar,et al. A multi-sensor fusion system for moving object detection and tracking in urban driving environments , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[13] John R. Hershey,et al. Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming , 2017, IEEE Journal of Selected Topics in Signal Processing.

[14] Katsutoshi Itoyama,et al. Posture estimation of hose-shaped robot using microphone array localization , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[16] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Yajie Miao,et al. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[19] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[20] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[21] James M. Rehg,et al. Aggressive Deep Driving: Combining Convolutional Neural Networks and Model Predictive Control , 2017, CoRL.

[22] Xavier Anguera Miró,et al. Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[23] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24] Anca D. Dragan,et al. Learning Robot Objectives from Physical Human Interaction , 2017, CoRL.

[25] John R. Hershey,et al. Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26] Shimon Whiteson,et al. LipNet: Sentence-level Lipreading , 2016, ArXiv.

[27] Jürgen Schmidhuber,et al. A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots , 2016, IEEE Robotics and Automation Letters.

[28] Jon Barker,et al. An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[29] Yoshua Bengio,et al. Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks , 2015, IEEE Transactions on Multimedia.

[30] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31] Steven Bohez,et al. Sensor fusion for robot control through deep reinforcement learning , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[32] Xin Zhang,et al. End to End Learning for Self-Driving Cars , 2016, ArXiv.

[33] Manuela M. Veloso,et al. Learning End-to-end Multimodal Sensor Policies for Autonomous Navigation , 2017, CoRL.

[34] Christian Wolf,et al. ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..