Out of Sight But Not Out of Mind: An Answer Set Programming Based Online Abduction Framework for Visual Sensemaking in Autonomous Driving

We demonstrate the need and potential of systematically integrated vision and semantics} solutions for visual sensemaking (in the backdrop of autonomous driving). A general method for online visual sensemaking using answer set programming is systematically formalised and fully implemented. The method integrates state of the art in (deep learning based) visual computing, and is developed as a modular framework usable within hybrid architectures for perception & control. We evaluate and demo with community established benchmarks KITTIMOD and MOT. As use-case, we focus on the significance of human-centred visual sensemaking ---e.g., semantic representation and explainability, question-answering, commonsense interpolation--- in safety-critical autonomous driving situations.

[1]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[2]  Leora Morgenstern,et al.  An Epistemic Event Calculus for ASP-based Reasoning About Knowledge of the Past, Present and Future , 2013, LPAR.

[3]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[4]  Theodore Patkos,et al.  Reasoning About Knowledge and Action in an Epistemic Event Calculus , 2013 .

[5]  Murray Shanahan,et al.  Perception as Abduction: Turning Sensor Data Into Meaningful Representation , 2005, Cogn. Sci..

[6]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[7]  Anthony G. Cohn,et al.  Abducing Qualitative Spatio-Temporal Histories from Partial Observations , 2002, KR.

[8]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Mehul Bhatt,et al.  The ‘Space’ in Spatial Assistance Systems : Conception, Formalisation and Computation , 2014 .

[10]  Xiaogang Wang,et al.  Spatial As Deep: Spatial CNN for Traffic Scene Understanding , 2017, AAAI.

[11]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[12]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[13]  Martin Gebser,et al.  Clingo = ASP + Control: Preliminary Report , 2014, ArXiv.

[14]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Martin Gebser,et al.  Answer Set Solving in Practice , 2012, Answer Set Solving in Practice.

[16]  Marvin A. Carlson Editor , 2015 .

[17]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[18]  Lidia Arroyo Prieto Acm , 2020, Encyclopedia of Cryptography and Security.

[19]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.

[20]  Carl P. L. Schultz,et al.  Visual Explanation by High-Level Abduction: On Answer-Set Programming Driven Reasoning about Moving Objects , 2017, AAAI.

[21]  Mehul Bhatt,et al.  Semantic Question-Answering with Video and Eye-Tracking Data: AI Foundations for Human Visual Perception Driven Cognitive Film Studies , 2016, IJCAI.

[22]  Anthony G. Cohn,et al.  Learning Relational Event Models from Video , 2015, J. Artif. Intell. Res..

[23]  Miroslaw Truszczynski,et al.  Answer set programming at a glance , 2011, Commun. ACM.

[24]  Ernest Davis,et al.  Commonsense reasoning and commonsense knowledge in artificial intelligence , 2015, Commun. ACM.

[25]  Mehul Bhatt,et al.  Modelling Dynamic Spatial Systems in the Situation Calculus , 2008, Spatial Cogn. Comput..

[26]  David R. Bull,et al.  Robust texture features for blurred images using Undecimated Dual-Tree Complex Wavelets , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[27]  Jerry R. Hobbs,et al.  Implementing Weighted Abduction in Markov Logic , 2011, IWCS.

[28]  Philippe Muller,et al.  A Qualitative Theory of Motion Based on Spatio-Temporal Primitives , 1998, KR.

[29]  Kewei Tu,et al.  Joint Video and Text Parsing for Understanding Events and Answering Queries , 2013, IEEE MultiMedia.

[30]  John Hannah,et al.  IEEE International Conference on Image Processing (ICIP) , 1997 .

[31]  Jeffrey Mark Siskind,et al.  A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video , 2015, J. Artif. Intell. Res..