Point cloud video object segmentation using a persistent supervoxel world-model

Robust visual tracking is an essential precursor to understanding and replicating human actions in robotic systems. In order to accurately evaluate the semantic meaning of a sequence of video frames, or to replicate an action contained therein, one must be able to coherently track and segment all observed agents and objects. This work proposes a novel online point cloud based algorithm which simultaneously tracks 6DoF pose and determines spatial extent of all entities in indoor scenarios. This is accomplished using a persistent supervoxel world-model which is updated, rather than replaced, as new frames of data arrive. Maintenance of a world model enables general object permanence, permitting successful tracking through full occlusions. Object models are tracked using a bank of independent adaptive particle filters which use a supervoxel observation model to give rough estimates of object state. These are united using a novel multi-model RANSAC-like approach, which seeks to minimize a global energy function associating world-model supervoxels to predicted states. We present results on a standard robotic assembly benchmark for two application scenarios - human trajectory imitation and semantic action understanding - demonstrating the usefulness of the tracking in intelligent robotic systems.

[1]  Minija Tamosiunaite,et al.  Joining Movement Sequences: Modified Dynamic Movement Primitives for Robotics Applications Exemplified on Handwriting , 2012, IEEE Transactions on Robotics.

[2]  Yazhe Tang,et al.  Structured sparse representation appearance model for robust visual tracking , 2011, 2011 IEEE International Conference on Robotics and Automation.

[3]  Pavel Krsek,et al.  Robust Euclidean alignment of 3D point sets: the trimmed iterative closest point algorithm , 2005, Image Vis. Comput..

[4]  B. Leibe,et al.  Taking Mobile Multi-object Tracking to the Next Level: People, Unknown Objects, and Carried Items , 2012, ECCV.

[5]  David Kim,et al.  Shake'n'sense: reducing interference for overlapping structured light depth cameras , 2012, CHI.

[6]  Radu Bogdan Rusu,et al.  3D is here: Point Cloud Library (PCL) , 2011, 2011 IEEE International Conference on Robotics and Automation.

[7]  Ming-Hsuan Yang,et al.  Robust Object Tracking with Online Multiple Instance Learning , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  K. Rathmill,et al.  The Development of a European Benchmark for the Comparison of Assembly Robot Programming Systems , 1985 .

[9]  Yuri Boykov,et al.  Energy-Based Geometric Multi-model Fitting , 2012, International Journal of Computer Vision.

[10]  K. Nakayama,et al.  Occlusion and the solution to the aperture problem for motion , 1989, Vision Research.

[11]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[12]  Tamim Asfour,et al.  6-DoF model-based tracking of arbitrarily shaped 3D objects , 2011, 2011 IEEE International Conference on Robotics and Automation.

[13]  Stefan Schaal,et al.  Robot Programming by Demonstration , 2009, Springer Handbook of Robotics.

[14]  D Marr,et al.  Directional selectivity and its use in early visual processing , 1981, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[15]  Aurélie Bugeau,et al.  Tracking with Occlusions via Graph Cuts , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Eren Erdal Aksoy,et al.  Learning the semantics of object–action relations by observation , 2011, Int. J. Robotics Res..

[17]  Florentin Wörgötter,et al.  Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Olga Veksler,et al.  Fast approximate energy minimization via graph cuts , 2001, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[19]  Dieter Fox,et al.  Adapting the Sample Size in Particle Filters Through KLD-Sampling , 2003, Int. J. Robotics Res..

[20]  Rüdiger Dillmann,et al.  Markerless human motion tracking with a flexible model and appearance learning , 2009, 2009 IEEE International Conference on Robotics and Automation.

[21]  Longin Jan Latecki,et al.  Maximum weight cliques with mutex constraints for video object segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Jun Nakanishi,et al.  Movement imitation with nonlinear dynamical systems in humanoid robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[23]  Eric L. Miller,et al.  Multiple Hypothesis Video Segmentation from Superpixel Flows , 2010, ECCV.

[24]  Takanori Yokoyama,et al.  Robust automatic video object segmentation with graphcut assisted by SURF features , 2012, 2012 19th IEEE International Conference on Image Processing.

[25]  Youfu Li,et al.  Robust visual tracking with structured sparse representation appearance model , 2012, Pattern Recognit..

[26]  Jun Nakanishi,et al.  Dynamical Movement Primitives: Learning Attractor Models for Motor Behaviors , 2013, Neural Computation.

[27]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.