Multimodal Deep Feature Fusion (MMDFF) for RGB-D Tracking

Visual tracking is still a challenging task due to occlusion, appearance changes, complex motion, etc. We propose a novel RGB-D tracker based on multimodal deep feature fusion (MMDFF) in this paper. MMDFF model consists of four deep Convolutional Neural Networks (CNNs): Motion-specific CNN, RGB- specific CNN, Depth-specific CNN, and RGB-Depth correlated CNN. The depth image is encoded into three channels which are sent into depth-specific CNN to extract deep depth features. The optical flow image is calculated for every frame and then is fed to motion-specific CNN to learn deep motion features. Deep RGB, depth, and motion information can be effectively fused at multiple layers via MMDFF model. Finally, multimodal fusion deep features are sent into the C-COT tracker to obtain the tracking result. For evaluation, experiments are conducted on two recent large-scale RGB-D datasets and results demonstrate that our proposed RGB-D tracking method achieves better performance than other state-of-art RGB-D trackers.

[1]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Antonis A. Argyros,et al.  Markerless 3D Human Pose Estimation and Tracking based on RGBD Cameras: an Experimental Evaluation , 2017, PETRA.

[4]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[5]  Ming-Hsuan Yang,et al.  Hierarchical Convolutional Features for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[7]  Kai Oliver Arras,et al.  Tracking people in 3D using a bottom-up top-down detector , 2011, 2011 IEEE International Conference on Robotics and Automation.

[8]  Guan Huang,et al.  UCT: Learning Unified Convolutional Networks for Real-Time Visual Tracking , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[9]  Xin Zhao,et al.  Locality-Sensitive Deconvolution Networks with Gated Fusion for RGB-D Indoor Semantic Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Shin Ishii,et al.  An occlusion-aware particle filter tracker to handle complex and persistent occlusions , 2016, Computer Vision and Image Understanding.

[11]  Kai Oliver Arras,et al.  People tracking in RGB-D data with on-line boosted target models , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Kaiqi Huang,et al.  Semi-supervised learning and feature evaluation for RGB-D object recognition , 2015, Comput. Vis. Image Underst..

[13]  Majid Mirmehdi,et al.  Real-time RGB-D Tracking with Depth Scaling Kernelised Correlation Filters and Occlusion Handling , 2015, BMVC.

[14]  Stefan Duffner,et al.  Using Discriminative Motion Context for Online Visual Object Tracking , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[15]  Wenbing Tao,et al.  Visual object tracking via enhanced structural correlation filter , 2017, Inf. Sci..

[16]  Tianshuang Qiu,et al.  Multi-person detecting and tracking based on RGB-D sensor for a robot vision system , 2017, Int. J. Embed. Syst..

[17]  Trevor Darrell,et al.  Cross-modal adaptation for RGB-D detection , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[18]  Zhenzhou Tang,et al.  Visual Object Tracking Based on Cross-Modality Gaussian-Bernoulli Deep Boltzmann Machines with RGB-D Sensors , 2017, Sensors.

[19]  Yuqing Gao,et al.  Robust Fusion of Color and Depth Data for RGB-D Target Tracking Using Adaptive Range-Invariant Depth Models and Spatio-Temporal Consistency Constraints , 2018, IEEE Transactions on Cybernetics.

[20]  Jiwen Lu,et al.  Modality and Component Aware Feature Fusion for RGB-D Scene Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Seungyong Lee,et al.  RDFNet: RGB-D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Bao-Liang Lu,et al.  Online Depth Image-Based Object Tracking with Sparse Representation and Object Detection , 2016, Neural Processing Letters.

[23]  Feng Wu,et al.  Object detection via deeply exploiting depth information , 2018, Neurocomputing.

[24]  Zhenzhou Tang,et al.  Tracking multiple targets based on min-cost network flows with detection in RGB-D data , 2017, Int. J. Comput. Sci. Eng..

[25]  Ning An,et al.  Online RGB-D tracking via detection-learning-segmentation , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[26]  Jiebo Luo,et al.  Multi-modal deep feature learning for RGB-D object detection , 2017, Pattern Recognit..

[27]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[28]  Rynson W. H. Lau,et al.  CREST: Convolutional Residual Learning for Visual Tracking , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[30]  Michael Felsberg,et al.  Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking , 2016, ECCV.

[31]  Yiannis Andreopoulos,et al.  Video Classification With CNNs: Using the Codec as a Spatio-Temporal Activity Sensor , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  K. Jandeleit-Dahm,et al.  Diabetes Alters Activation and Repression of Pro- and Anti-Inflammatory Signaling Pathways in the Vasculature , 2013, Front. Endocrinol..

[33]  Guisong Xia,et al.  Visual object tracking by correlation filters and online learning , 2017, ISPRS Journal of Photogrammetry and Remote Sensing.

[34]  Wei Wu,et al.  End-to-End Flow Correlation Tracking with Spatial-Temporal Attention , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Jianxiong Xiao,et al.  Tracking Revisited Using RGBD Camera: Unified Benchmark and Baselines , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Xiaogang Wang,et al.  Visual Tracking with Fully Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Michael Felsberg,et al.  Deep motion and appearance cues for visual tracking , 2019, Pattern Recognit. Lett..