Self-Supervised Moving Vehicle Tracking With Stereo Sound

Humans are able to localize objects in the environment using both visual and auditory cues, integrating information from multiple modalities into a common reference frame. We introduce a system that can leverage unlabeled audiovisual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time. Since it is labor-intensive to manually annotate the correspondences between audio and object bounding boxes, we achieve this goal by using the co-occurrence of visual and audio streams in unlabeled videos as a form of self-supervision, without resorting to the collection of ground truth annotations. In particular, we propose a framework that consists of a vision ``teacher'' network and a stereo-sound ``student'' network. During training, knowledge embodied in a well-established visual vehicle detection model is transferred to the audio domain using unlabeled videos as a bridge. At test time, the stereo-sound student network can work independently to perform object localization using just stereo audio and camera meta-data, without any visual input. Experimental results on a newly collected Auditory Vehicles Tracking dataset verify that our proposed approach outperforms several baseline approaches. We also demonstrate that our cross-modal auditory localization approach can assist in the visual localization of moving vehicles under poor lighting conditions.

[1]  A. Winder II. Sonar System Technology , 1975, IEEE Transactions on Sonics and Ultrasonics.

[2]  John J. Leonard,et al.  Directed Sonar Sensing for Mobile Robot Navigation , 1992 .

[3]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[4]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[5]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[6]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[8]  Chang Huang,et al.  Learning to associate: HybridBoosted multi-target tracker for crowded scene , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[10]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[11]  Mubarak Shah,et al.  Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects , 2013, IEEE Transactions on Multimedia.

[12]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[13]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[14]  Yi Yang,et al.  You Lead, We Exceed: Labor-Free Video Concept Learning by Jointly Exploiting Web Videos and Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  A. Torralba,et al.  Learning Aligned Cross-Modal Representations from Weakly Aligned Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Emmanuel Vincent,et al.  Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[19]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[20]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Tara N. Sainath,et al.  Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Antonio Torralba,et al.  See, Hear, and Read: Deep Aligned Representations , 2017, ArXiv.

[23]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Verena R. Sommer,et al.  Hearing Scenes: A Neuromagnetic Signature of Auditory Source and Reverberant Space Separation , 2017, eNeuro.

[25]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Nuno Vasconcelos,et al.  Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.

[28]  Leonidas J. Guibas,et al.  Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[30]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[31]  Andrew Zisserman,et al.  Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[32]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[33]  SEE , 2018, Catalysis from A to Z.

[34]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[36]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[37]  Chuang Gan,et al.  Self-supervised Audio-visual Co-segmentation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Chuang Gan,et al.  The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  B. Hodson,et al.  The effect of passage in vitro and in vivo on the properties of murine fibrosarcomas. II. Sensitivity to cell-mediated cytotoxicity in vitro. , 1985, British Journal of Cancer.

[40]  Radu Horaud,et al.  Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments , 2018, IEEE Journal of Selected Topics in Signal Processing.

[41]  Radu Horaud,et al.  Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Andrew L. Yarrow Look , 2019, Portrait.