PACS: A Dataset for Physical Audiovisual CommonSense Reasoning

, Abstract. In order for AI to be safely deployed in real-world scenarios such as hospitals, schools, and the workplace, it must be able to robustly reason about the physical world. Fundamental to this reasoning is physical common sense : understanding the physical properties and affordances of available objects, how they can be manipulated, and how they interact with other objects. Physical commonsense reasoning is fundamentally a multi-sensory task, since physical properties are manifested through multiple modalities - two of them being vision and acoustics. Our paper takes a step towards real-world physical commonsense reasoning by contributing PACS : the first audiovisual benchmark annotated for physical commonsense attributes. PACS contains 13,400 question-answer pairs, involving 1,377 unique physical commonsense questions and 1,526 videos. Our dataset provides new opportunities to advance the research field of physical reasoning by bringing audio as a core component of this multimodal problem. Using PACS , we evaluate multiple state-of-the-art models on our new challenging task. While some models show promising results ( 70% accuracy), they all fall short of human performance ( 95% accuracy). We conclude the paper by demonstrating the importance of multimodal reasoning and providing possible avenues for future research.

[1]  Cewu Lu,et al.  HAKE: A Knowledge Engine Foundation for Human Activity Understanding , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Federico Raue,et al.  Audioclip: Extending Clip to Image, Text and Audio , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jianfeng Gao,et al.  DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing , 2021, ArXiv.

[4]  Li Fei-Fei,et al.  ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and Tactile Representations , 2021, CoRL.

[5]  J. Chai,et al.  Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Language Understanding , 2021, EMNLP.

[6]  Ali Farhadi,et al.  MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.

[7]  Mausam,et al.  TANGO: Commonsense Generalization in Predicting Tool Interactions for Mobile Manipulators , 2021, IJCAI.

[8]  Andreas Dengel,et al.  ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio , 2021, 2021 International Joint Conference on Neural Networks (IJCNN).

[9]  Lihi Zelnik-Manor,et al.  ImageNet-21K Pretraining for the Masses , 2021, NeurIPS Datasets and Benchmarks.

[10]  James R. Glass,et al.  AST: Audio Spectrogram Transformer , 2021, Interspeech.

[11]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[12]  Limin Wang,et al.  TDN: Temporal Difference Networks for Efficient Action Recognition , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[14]  Yagya Raj Pandeya,et al.  Deep learning-based late fusion of multimodal information for emotion classification of music video , 2020, Multimedia Tools and Applications.

[15]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[16]  Zhenjie Zhao,et al.  Learning Physical Common Sense as Knowledge Graph Completion via BERT Data Augmentation and Constrained Tucker Factorization , 2020, EMNLP.

[17]  Tom'avs Souvcek,et al.  TransNet V2: An effective deep network architecture for fast shot transition detection , 2020, ArXiv.

[18]  Louis-Philippe Morency,et al.  What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets , 2020, ArXiv.

[19]  Yejin Choi,et al.  VisualCOMET: Reasoning About the Dynamic Context of a Still Image , 2020, ECCV.

[20]  K. Grauman,et al.  SoundSpaces: Audio-Visual Navigation in 3D Environments , 2019, ECCV.

[21]  Yejin Choi,et al.  PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[22]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[23]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[24]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[25]  Song-Chun Zhu,et al.  Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Yejin Choi,et al.  Do Neural Language Representations Learn Physical Commonsense? , 2019, CogSci.

[27]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Louis-Philippe Morency,et al.  Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Matthieu Cord,et al.  RUBi: Reducing Unimodal Biases in Visual Question Answering , 2019, NeurIPS.

[30]  Sonia Chernova,et al.  Tool Macgyvering: Tool Construction Using Geometric Reasoning , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[31]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yoav Artzi,et al.  A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[33]  Hugo Larochelle,et al.  Blindfold Baselines for Embodied QA , 2018, ArXiv.

[34]  P. Corlett,et al.  Conditioned hallucinations: historic insights and future directions , 2018, World psychiatry : official journal of the World Psychiatric Association.

[35]  Licheng Yu,et al.  TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[36]  Marc Toussaint,et al.  Differentiable Physics and Stable Modes for Tool-Use and Manipulation Planning , 2018, Robotics: Science and Systems.

[37]  David M. Mimno,et al.  Quantifying the Visual Concreteness of Words and Topics in Multimodal Datasets , 2018, NAACL.

[38]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Jiajun Wu,et al.  Generative Modeling of Audible Shapes for Object Perception , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Yoav Artzi,et al.  A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[41]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Ali Farhadi,et al.  Commonly Uncommon: Semantic Sparsity in Situation Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Tegan Maharaj,et al.  A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jiajun Wu,et al.  Shape and Material from Sound , 2017, NIPS.

[46]  Jiajun Wu,et al.  Learning to See Physics via Visual De-animation , 2017, NIPS.

[47]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[48]  Allan Jabri,et al.  Revisiting Visual Question Answering Baselines , 2016, ECCV.

[49]  Jitendra Malik,et al.  Learning to Poke by Poking: Experiential Learning of Intuitive Physics , 2016, NIPS.

[50]  Dhruv Batra,et al.  Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[51]  Ali Farhadi,et al.  "What Happens If..." Learning to Predict the Effect of Forces in Images , 2016, ECCV.

[52]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[53]  Susan J. Hespos,et al.  Five-Month-Old Infants Have General Knowledge of How Nonsolid Substances Behave and Interact , 2016, Psychological science.

[54]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Theodoros Giannakopoulos pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis , 2015, PloS one.

[56]  Jiajun Wu,et al.  Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning , 2015, NIPS.

[57]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[58]  Kevin A. Smith,et al.  Consistent physics underlying ballistic motion prediction , 2013, CogSci.

[59]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  J. Bliss Commonsense reasoning about the physical world , 2008 .

[61]  Teresa Wilcox,et al.  Shake, Rattle, and … One or Two Objects? Young Infants' Use of Auditory Information to Individuate Objects. , 2006, Infancy : the official journal of the International Society on Infant Studies.

[62]  Susan J. Hespos,et al.  Conceptual precursors to language , 2004, Nature.

[63]  Marvin Minsky,et al.  Commonsense-based interfaces , 2000, CACM.

[64]  E. Spelke,et al.  Perception and understanding of effects of gravity and inertia on object motion , 1999 .

[65]  B. Morrongiello,et al.  Crossmodal learning in newborn infants: Inferences about properties of auditory-visual events , 1998 .

[66]  S. Handel,et al.  Chapter 12 – Timbre Perception and Auditory Object Identification , 1995 .

[67]  Peter Szolovits,et al.  What Is a Knowledge Representation? , 1993, AI Mag..

[68]  J. Jonides,et al.  Intuitive reasoning about abstract and familiar physics problems , 1986, Memory & cognition.

[69]  Daniel G. Bobrow,et al.  Qualitative Reasoning about Physical Systems: An Introduction , 1984, Artif. Intell..

[70]  Kenneth D. Forbus Qualitative Process Theory , 1984, Artif. Intell..