3D Semantic Trajectory Reconstruction from 3D Pixel Continuum

This paper presents a method to assign a semantic label to a 3D reconstructed trajectory from multiview image streams. The key challenge of the semantic labeling lies in the self-occlusion and photometric inconsistency caused by object and social interactions, resulting in highly fragmented trajectory reconstruction with noisy semantic labels. We address this challenge by introducing a new representation called 3D semantic map-a probability distribution over labels per 3D trajectory constructed by a set of semantic recognition across multiple views. Our conjecture is that among many views, there exist a set of views that are more informative than the others. We build the 3D semantic map based on a likelihood of visibility and 2D recognition confidence and identify the view that best represents the semantics of the trajectory. We use this 3D semantic map and trajectory affinity computed by local rigid transformation to precisely infer labels as a whole. This global inference quantitatively outperforms the baseline approaches in terms of predictive validity, representation robustness, and affinity effectiveness. We demonstrate that our algorithm can robustly compute the semantic labels of a large scale trajectory set (e.g., millions of trajectories) involving real-world human interactions with object, scenes, and people.

[1]  Alessio Del Bue,et al.  Segmentation of Rigid Motion from Non-rigid 2D Trajectories , 2007, IbPRIA.

[2]  Henning Biermann,et al.  Recovering non-rigid 3D shape from image streams , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[3]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Amnon Shashua,et al.  Trajectory Triangulation: 3D Reconstruction of Moving Points from a Monocular Image Sequence , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Lorenzo Torresani,et al.  Space-Time Tracking , 2002, ECCV.

[6]  David J. Fleet,et al.  Stochastic Tracking of 3D Human Figures Using 2D Image Motion , 2000, ECCV.

[7]  R. Vidal,et al.  Sparse Subspace Clustering: Algorithm, Theory, and Applications. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[8]  René Vidal,et al.  Motion Segmentation in the Presence of Outlying, Incomplete, or Corrupted Trajectories , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Luc Van Gool,et al.  Reconstructing 3D trajectories of independently moving objects using generic constraints , 2004, Comput. Vis. Image Underst..

[10]  Vladlen Koltun,et al.  Feature Space Optimization for Semantic Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Carlo Tomasi,et al.  Video Motion for Every Visible Point , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[13]  Ian D. Reid,et al.  RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[15]  Takeo Kanade,et al.  Nonrigid Structure from Motion in Trajectory Space , 2008, NIPS.

[16]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[17]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[18]  Kiriakos N. Kutulakos,et al.  Non-rigid structure from locally-rigid motion , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Lorenzo Torresani,et al.  Tracking and modeling non-rigid objects with rank constraints , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[20]  Yasuyuki Matsushita,et al.  GMS: Grid-Based Motion Statistics for Fast, Ultra-robust Feature Correspondence , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Richard Szeliski,et al.  Modeling the World from Internet Photo Collections , 2008, International Journal of Computer Vision.

[22]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Yaser Sheikh,et al.  3D reconstruction of a smooth articulated trajectory from a monocular image sequence , 2011, 2011 International Conference on Computer Vision.

[24]  Yaser Sheikh,et al.  3D Reconstruction of a Moving Point from a Series of 2D Projections , 2010, ECCV.

[25]  Anton Osokin,et al.  Fast Approximate Energy Minimization with Label Costs , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[26]  Marc Pollefeys,et al.  Automatic Kinematic Chain Building from Feature Trajectories of Articulated Objects , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[27]  Brian Taylor,et al.  Semantic Video Segmentation from Occlusion Relations within a Convex Optimization Framework , 2013, EMMCVPR.

[28]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Pascal Fua,et al.  Surface Deformation Models for Nonrigid 3D Shape Recovery , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Rui Yu,et al.  Video Pop-up: Monocular 3D Reconstruction of Dynamic Scenes , 2014, ECCV.

[31]  Takeo Kanade,et al.  Background Subtraction for Freely Moving Cameras , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[32]  Pascal Fua,et al.  Simultaneous point matching and 3D deformable surface reconstruction , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[33]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[34]  Yaser Sheikh,et al.  MAP Visibility Estimation for Large-Scale Dynamic 3D Reconstruction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Yaser Sheikh,et al.  Bilinear spatiotemporal basis models , 2012, TOGS.

[36]  Mina Teicher,et al.  A General Framework for Trajectory Triangulation , 2004, Journal of Mathematical Imaging and Vision.

[37]  Katerina Fragkiadaki,et al.  Video segmentation by tracing discontinuities in a trajectory embedding , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  N. Ambady,et al.  Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis. , 1992 .

[39]  Hongdong Li,et al.  A simple prior-free method for non-rigid structure-from-motion factorization , 2012, CVPR.

[40]  Yaser Sheikh,et al.  In defense of orthonormality constraints for nonrigid structure from motion , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Jitendra Malik,et al.  Grouping-Based Low-Rank Trajectory Completion and 3D Reconstruction , 2014, NIPS.

[42]  James M. Rehg,et al.  Joint Semantic Segmentation and 3D Reconstruction from Monocular Video , 2014, ECCV.