Estimating more camera poses for ego-centric videos is essential for VQ3D

Visual queries 3D localization (VQ3D) is a task in the Ego4D Episodic Memory Benchmark. Given an egocentric video, the goal is to answer queries of the form ”Where did I last see object X?” where the query object X is specified as a static image, and the answer should be a 3D displacement vector pointing to object X. However, current techniques use naive ways to estimate the camera poses of video frames, resulting in a low query with pose (QwP) ratio, thus a poor overall success rate. We design a new pipeline for the challenging egocentric video camera pose estimation problem in our work. Moreover, we revisit the current VQ3D framework and optimize it in terms of performance and efficiency. As a result, we get the top-1 overall success rate of 25.8% on VQ3D leaderboard, which is two times better than the 8.7% reported by the baseline.

[1]  Cheng-Yang Fu,et al.  Negative Frames Matter in Egocentric Visual Query 2D Localization , 2022, ArXiv.

[2]  Dhruv Batra,et al.  Episodic Memory Question Answering , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  James M. Rehg,et al.  Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Vladlen Koltun,et al.  Vision Transformers for Dense Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Tomasz Malisiewicz,et al.  SuperGlue: Learning Feature Matching With Graph Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[7]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).