Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals

The ability to sequence unordered events is an essential skill to comprehend and reason about real world task procedures, which often requires thorough understanding of temporal common sense and multimodal information, as these procedures are often communicated through a combination of texts and images. Such capability is essential for applications such as sequential task planning and multisource instruction summarization. While humans are capable of reasoning about and sequencing unordered multimodal procedural instructions, whether current machine learning models have such essential capability is still an open question. In this work, we benchmark models’ capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from popular online instructional manuals and collecting comprehensive human annotations. We find models not only perform significantly worse than humans but also seem incapable of efficiently utilizing the multimodal information. To improve machines’ performance on multimodal event sequencing, we propose sequentiality-aware pretraining techniques that exploit the sequential alignment properties of both texts and images, resulting in >5% significant improvements.

[1]  S. Tomkins The Tomkins-Horn picture-arrangement test. , 1952, Transactions of the New York Academy of Sciences.

[2]  Mirella Lapata,et al.  Probabilistic Text Structuring: Experiments with Sentence Ordering , 2003, ACL.

[3]  Yueting Zhuang,et al.  Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Dhruv Batra,et al.  Sort Story: Sorting Jumbled Images and Captions into Stories , 2016, EMNLP.

[5]  Honglak Lee,et al.  Sentence Ordering and Coherence Modeling using Recurrent Neural Networks , 2016, AAAI.

[6]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Chris Callison-Burch,et al.  Intent Detection with WikiHow , 2020, AACL.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Francis Ferraro,et al.  Visual Storytelling , 2016, NAACL.

[12]  Bhavana Dalvi,et al.  A Dataset for Tracking Entities in Open Domain Procedural Text , 2020, EMNLP.

[13]  Yingming Li,et al.  BERT-enhanced Relational Sentence Ordering Network , 2020, EMNLP.

[14]  Chris Callison-Burch,et al.  Reasoning about Goals, Steps, and Temporal Ordering with WikiHow , 2020, EMNLP.

[15]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[16]  Zhe Gan,et al.  HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Samy Bengio,et al.  Order Matters: Sequence to sequence for sets , 2015, ICLR.

[19]  Xuanjing Huang,et al.  Neural Sentence Ordering , 2016, ArXiv.

[20]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[21]  Rémi Calizzano,et al.  Ordering sentences and paragraphs with pre-trained encoder-decoder transformers and pointer ensembles , 2021, DocEng.

[22]  Kurt Keutzer,et al.  How Much Can CLIP Benefit Vision-and-Language Tasks? , 2021, ICLR.

[23]  Seungmin Seo,et al.  Topic-Guided Coherence Modeling for Sentence Ordering by Preserving Global and Local Information , 2019, EMNLP.

[24]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[25]  Ming-Hsuan Yang,et al.  Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[27]  Mauro Birattari,et al.  Autonomous task sequencing in a robot swarm , 2018, Science Robotics.

[28]  Zhongfei Zhang,et al.  Deep Attentive Sentence Ordering Network , 2018, EMNLP.

[29]  Jie Shao,et al.  COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation , 2021, IEEE International Conference on Computer Vision.

[30]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[31]  Xuanjing Huang,et al.  End-to-End Neural Sentence Ordering Using Pointer Network , 2016, ArXiv.

[32]  Andrew N. Meltzoff,et al.  Children's Representation and Imitation of Events: How Goal Organization Influences 3-Year-Old Children's Memory for Action Sequences , 2017, Cogn. Sci..

[33]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[35]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  S. Baron-Cohen,et al.  Mechanical, behavioural and Intentional understanding of picture stories in autistic children , 1986 .

[37]  Haejun Lee,et al.  SLM: Learning a Discourse Language Representation with Sentence Unshuffling , 2020, EMNLP.

[38]  Ahmed El Kholy,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[39]  Chris Callison-Burch,et al.  Visual Goal-Step Inference using wikiHow , 2021, EMNLP.

[40]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  William Yang Wang,et al.  WikiHow: A Large Scale Text Summarization Dataset , 2018, ArXiv.

[43]  Nazli Ikizler-Cinbis,et al.  RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes , 2018, EMNLP.

[44]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[45]  Furu Wei,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ArXiv.

[46]  Yu Guan,et al.  Order Matters: Shuffling Sequence Generation for Video Prediction , 2019, 1907.08845.

[47]  Jianlong Fu,et al.  Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[48]  Steven Schockaert,et al.  Learning Household Task Knowledge from WikiHow Descriptions , 2019, SemDeep@IJCAI.