Joint Parsing of Cross-view Scenes with Spatio-temporal Semantic Parse Graphs

Cross-view video understanding is an important yet underexplored area in computer vision. In this paper, we introduce a joint parsing method that takes view-centric proposals from pre-trained computer vision models and produces spatiotemporal parse graphs that represents a coherent scene-centric understanding of cross-view scenes. Our key observations are that overlapping fields of views embed rich appearance and geometry correlations and that knowledge segments corresponding to individual vision tasks are governed by consistency constraints available in commonsense knowledge. The proposed joint parsing framework models such correlations and constraints explicitly and generates semantic parse graphs about the scene. Quantitative experiments show that scene-centric predictions in the parse graph outperform viewcentric predictions.

[1]  Pascal Fua,et al.  Multicamera People Tracking with a Probabilistic Occupancy Map , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  H. Chad Lane,et al.  Explainable Artificial Intelligence for Training and Tutoring , 2005, AIED.

[3]  Chitta Baral,et al.  DeepIU : An Architecture for Image Understanding , 2016 .

[4]  Trevor Darrell,et al.  Generating Visual Explanations , 2016, ECCV.

[5]  Yang Liu,et al.  Multi-view People Tracking via Hierarchical Trajectory Composition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Michael van Lent,et al.  An Explainable Artificial Intelligence System for Small-unit Tactical Behavior , 2004, AAAI.

[7]  Kobus Barnard,et al.  Understanding Bayesian Rooms Using Composite 3D Object Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  K. McKeown,et al.  Justification Narratives for Individual Classifications , 2014 .

[9]  E. Vincent Cross,et al.  Explaining robot actions , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[10]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[11]  Song-Chun Zhu,et al.  Single-View 3D Scene Parsing by Attributed Grammar , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Nanning Zheng,et al.  Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Hugo Liu,et al.  ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[14]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Moritz Tenorth,et al.  The TUM Kitchen Data Set of everyday manipulation activities for motion tracking and action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[17]  Liang Lin,et al.  I2T: Image Parsing to Text Description , 2010, Proceedings of the IEEE.

[18]  John Canny,et al.  An Efficient Minibatch Acceptance Test for Metropolis-Hastings , 2016, UAI.

[19]  Cordelia Schmid,et al.  Multi-view object class detection with a 3D geometric model , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Donald Geman,et al.  Visual Turing test for computer vision systems , 2015, Proceedings of the National Academy of Sciences.

[21]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  H. Chad Lane,et al.  Building Explainable Artificial Intelligence Systems , 2006, AAAI.

[23]  Charless C. Fowlkes,et al.  Globally-optimal greedy algorithms for tracking a variable number of objects , 2011, CVPR 2011.

[24]  Ying Wu,et al.  Cross-View Action Modeling, Learning, and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Song-Chun Zhu,et al.  Attribute And-Or Grammar for Joint Parsing of Human Attributes, Part and Pose , 2016, ArXiv.

[26]  Feng Han,et al.  Bottom-Up/Top-Down Image Parsing with Attribute Grammar , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Daniel Wolf,et al.  Hypergraphs for Joint Multi-view Reconstruction and Multi-object Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[29]  Kewei Tu,et al.  Joint Video and Text Parsing for Understanding Events and Answering Queries , 2013, IEEE MultiMedia.

[30]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[31]  Ákos Utasi,et al.  A 3-D marked point process model for multi-view people detection , 2011, CVPR 2011.

[32]  Pascal Fua,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Multiple Object Tracking Using K-shortest Paths Optimization , 2022 .

[33]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[34]  Binlong Li,et al.  Dynamic subspace-based coordinated multicamera tracking , 2011, 2011 International Conference on Computer Vision.

[35]  Regina Barzilay,et al.  Rationalizing Neural Predictions , 2016, EMNLP.