Unsupervised Spike Depth Estimation via Cross-modality Cross-domain Knowledge Transfer

Neuromorphic spike data, an upcoming modality with high temporal resolution, has shown promising potential in real-world applications due to its inherent advantage to overcome high-velocity motion blur. However, training the spike depth estimation network holds significant challenges in two aspects: sparse spatial information for dense regression tasks, and difficulties in achieving paired depth labels for temporally intensive spike streams. In this paper, we thus propose a cross-modality cross-domain (BiCross) framework to realize unsupervised spike depth estimation with the help of open-source RGB data. It first transfers cross-modality knowledge from source RGB to mediates simulated source spike data, then realizes cross-domain learning from simulated source spike to target spike data. Specifically, Coarse-to-Fine Knowledge Distillation (CFKD) is introduced to transfer cross-modality knowledge in global and pixel-level in the source domain, which complements sparse spike features by sufficient semantic knowledge of image features. We then propose Uncertainty Guided Teacher-Student (UGTS) method to realize cross-domain learning on spike target domain, ensuring domain-invariant global and pixel-level knowledge of teacher and student model through alignment and uncertainty guided depth selection measurement. To verify the effectiveness of BiCross, we conduct extensive experiments on three scenarios, including Synthetic to Real, Extreme Weather, and Scene Changing. The code and datasets will be released.

[1]  Yonghong Tian,et al.  Learning Stereo Depth Estimation with Bio-Inspired Spike Cameras , 2022, 2022 IEEE International Conference on Multimedia and Expo (ICME).

[2]  Yonghong Tian,et al.  Retinomorphic Object Detection in Asynchronous Visual Streams , 2022, AAAI.

[3]  Steven G. McDonagh,et al.  CroMo: Cross-Modal Learning for Monocular Depth Estimation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yonghong Tian,et al.  Ultra-High Temporal Resolution Visual Reconstruction From a Fovea-Like Spike Camera via Spiking Neuron Model , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Boxin Shi,et al.  Optical Flow Estimation for Spiking Camera , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yonghong Tian,et al.  NeuSpike-Net: High Speed Video Reconstruction via Bio-inspired Neuromorphic Cameras , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Liusheng Huang,et al.  Revealing the Reciprocal Relations between Self-Supervised Stereo and Monocular Depth Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Tiejun Huang,et al.  Super Resolve Dynamic Scene from Continuous Spike Streams , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  D. Tao,et al.  Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers , 2021, ACM Multimedia.

[10]  Xiang Bai,et al.  End-to-End Semi-Supervised Object Detection with Soft Teacher , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Tiejun Huang,et al.  High-speed Image Reconstruction through Short-term Plasticity for Spiking Cameras , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ruiqin Xiong,et al.  Spk2ImgNet: Learning to Reconstruct Dynamic Scene from Continuous Spike Stream , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Tae-Kyun Kim,et al.  EvDistill: Asynchronous Events to End-task Learning via Bidirectional Reconstruction-guided Cross-modal Knowledge Distillation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Miao Zhang,et al.  Learning Multi-modal Information for Robust Light Field Depth Estimation , 2021, ArXiv.

[15]  Munchurl Kim,et al.  XVFI: eXtreme Video Frame Interpolation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Vladlen Koltun,et al.  Vision Transformers for Dense Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Hui Xiong,et al.  Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting , 2020, AAAI.

[18]  Jing Zhao,et al.  High-Speed Motion Scene Reconstruction for Spike Camera via Motion Aligned Filtering , 2020, 2020 IEEE International Symposium on Circuits and Systems (ISCAS).

[19]  K. Mikolajczyk,et al.  DESC: Domain Adaptation for Depth Estimation via Semantic Consistency , 2020, International Journal of Computer Vision.

[20]  Yonghong Tian,et al.  Retina-Like Visual Image Reconstruction via Spiking Neural Model , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Han Zhang,et al.  A Simple Semi-Supervised Learning Framework for Object Detection , 2020, ArXiv.

[22]  V. Lepetit,et al.  Predicting Sharp and Accurate Occlusion Boundaries in Monocular Depth Estimation Using Displacement Fields , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Tiejun Huang,et al.  A Retina-Inspired Sampling Method for Visual Texture Reconstruction , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[24]  Xia Li,et al.  6D-VNet: End-To-End 6DoF Vehicle Pose Estimation From Monocular RGB Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25]  Zhidong Deng,et al.  DrivingStereo: A Large-Scale Dataset for Stereo Matching in Autonomous Driving Scenarios , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Chang-Su Kim,et al.  Monocular Depth Estimation Using Relative Depth Maps , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Vincent Lepetit,et al.  SharpNet: Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[28]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[29]  Dacheng Tao,et al.  Geometry-Aware Symmetric Domain Adaptation for Monocular Depth Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Nicu Sebe,et al.  Refine and Distill: Exploiting Cycle-Inconsistency and Knowledge Distillation for Unsupervised Monocular Depth Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Tiejun Huang,et al.  An Efficient Coding Method for Spike Camera Using Inter-Spike Intervals , 2019, 2019 Data Compression Conference (DCC).

[32]  Zheng Zhang,et al.  Star-Transformer , 2019, NAACL.

[33]  Adrien Gaidon,et al.  ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Dieter Fox,et al.  Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects , 2018, CoRL.

[35]  R. Devon Hjelm,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[36]  Jianfei Cai,et al.  T2Net: Synthetic-to-Realistic Translation for Solving Single-Image Depth Estimation Tasks , 2018, ECCV.

[37]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Nicu Sebe,et al.  Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  R. Venkatesh Babu,et al.  AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Yoshua Bengio,et al.  Learning Independent Features with Adversarial Nets for Non-linear ICA , 2017, 1710.05050.

[42]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[44]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Tiejun Huang,et al.  Spike Camera and Its Coding Methods , 2017, 2017 Data Compression Conference (DCC).

[46]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Qiao Wang,et al.  VirtualWorlds as Proxy for Multi-object Tracking Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[51]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[52]  Xi Wang,et al.  High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth , 2014, GCPR.

[53]  Ming-Hsuan Yang,et al.  Joint Depth Estimation and Camera Shake Removal from Single Blurry Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  R. Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[55]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[56]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Eugenio Culurciello,et al.  Activity-driven, event-based vision sensors , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[58]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[59]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  H. Chandler Database , 1985 .

[61]  Ruiqin Xiong,et al.  SCFlow: Optical Flow Estimation for Spiking Camera , 2021, ArXiv.

[62]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.