Neuro-symbolic Visual Reasoning for Multimedia Event Processing: Overview, Prospects and Challenges

Efficient multimedia event processing is a key enabler for real-time and complex decision making in streaming media. The need for expressive queries to detect high-level human-understandable spatial and temporal events in multimedia streams is inevitable due to the explosive growth of multimedia data in smart cities and internet. The recent work in stream reasoning, event processing and visual reasoning inspires the integration of visual and commonsense reasoning in multimedia event processing, which would improve and enhance multimedia event processing in terms of expressivity of event rules and queries. This can be achieved through careful integration of knowledge about entities, relations and rules from rich knowledge bases via reasoning over multimedia streams within an event processing engine. The prospects of neuro-symbolic visual reasoning within multimedia event processing are promising, however, there are several associated challenges that are highlighted in this paper.

[1]  Xiaogang Wang,et al.  ViP-CNN: Visual Phrase Guided Convolutional Neural Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Xiaojun Chen,et al.  A review: Knowledge reasoning over knowledge graph , 2020, Expert Syst. Appl..

[3]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[4]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[5]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Richard Chbeir,et al.  MSSN-Onto: An ontology-based approach for flexible event processing in Multimedia Sensor Networks , 2020, Future Gener. Comput. Syst..

[7]  Edward Curry,et al.  Reducing Response Time for Multimedia Event Processing using Domain Adaptation , 2020, ICMR.

[8]  Richard Socher,et al.  Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[9]  Mathias Niepert,et al.  Learning Sequence Encoders for Temporal Knowledge Graph Completion , 2018, EMNLP.

[10]  Ali Farhadi,et al.  VisKE: Visual knowledge extraction and question answering by visual verification of relation phrases , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jianfeng Du,et al.  Iterative Visual Relationship Detection via Commonsense Knowledge Graph , 2019, JIST.

[12]  Edward Curry,et al.  VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Event Pattern Matching , 2019, 2019 First International Conference on Graph Computing (GC).

[13]  Pierluigi Ritrovato,et al.  On the use of semantic technologies for video analytics , 2021, Journal of Ambient Intelligence and Humanized Computing.

[14]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[15]  Guilin Qi,et al.  Hybrid reasoning in knowledge graphs: Combing symbolic reasoning and statistical reasoning , 2020, Semantic Web.

[16]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Cordelia Schmid,et al.  Detecting Unseen Visual Relations Using Analogies , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[19]  Ugur Kursuncu,et al.  Knowledge Infused Learning (K-IL): Towards Deep Incorporation of Knowledge in Deep Learning , 2020, AAAI Spring Symposium: Combining Machine Learning with Knowledge Engineering.

[20]  Pietro Perona,et al.  Describing Common Human Visual Actions in Images , 2015, BMVC.

[21]  Ji Zhang,et al.  Large-Scale Visual Relationship Understanding , 2018, AAAI.

[22]  Edward Curry,et al.  VidCEP: Complex Event Processing Framework to Detect Spatiotemporal Patterns in Video Streams , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[23]  Edward Curry,et al.  Towards a Generalized Approach for Deep Neural Network Based Event Processing for the Internet of Multimedia Things , 2018, IEEE Access.