RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes

Understanding and reasoning about cooking recipes is a fruitful research direction towards enabling machines to interpret procedural text. In this work, we introduce RecipeQA, a dataset for multimodal comprehension of cooking recipes. It comprises of approximately 20K instructional recipes with multiple modalities such as titles, descriptions and aligned set of images. With over 36K automatically generated question-answer pairs, we design a set of comprehension and reasoning tasks that require joint understanding of images and text, capturing the temporal flow of events and making sense of procedural knowledge. Our preliminary results indicate that RecipeQA will serve as a challenging test bed and an ideal benchmark for evaluating machine comprehension systems. The data and leaderboard are available at http://hucvl.github.io/recipeqa.

[1]  Ali Farhadi,et al.  A Diagram is Worth a Dozen Images , 2016, ECCV.

[2]  David Berthelot,et al.  WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia , 2016, ACL.

[3]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[4]  Yejin Choi,et al.  Globally Coherent Text Generation with Neural Checklist Models , 2016, EMNLP.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yoshua Bengio,et al.  FigureQA: An Annotated Figure Dataset for Visual Reasoning , 2017, ICLR.

[7]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[8]  Lucy Vanderwende,et al.  Answering and Questioning for Machine Reading , 2007, AAAI Spring Symposium: Machine Reading.

[9]  Peter Norvig A Unified Theory of Inference for Text Understanding , 1986 .

[10]  Omer Levy,et al.  Simulating Action Dynamics with Neural Process Networks , 2017, ICLR.

[11]  Larry S. Davis,et al.  The Amazing Mysteries of the Gutter: Drawing Inferences Between Panels in Comic Book Narratives , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Lynette Hirschman,et al.  Natural language question answering: the view from here , 2001, Natural Language Engineering.

[13]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[14]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[15]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[17]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jonghyun Choi,et al.  Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[21]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[22]  Silvio Savarese,et al.  Unsupervised Semantic Parsing of Video Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Jason Weston,et al.  The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations , 2015, ICLR.

[24]  Daniel Petrie,et al.  Are You Smarter Than a Fifth Grader , 2010 .

[25]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[26]  Kevin Murphy,et al.  What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision , 2015, NAACL.

[27]  Nizar Habash,et al.  Predicting the Structure of Cooking Recipes , 2015, EMNLP.

[28]  Akiko Aizawa,et al.  Prerequisite Skills for Reading Comprehension: Multi-Perspective Analysis of MCTest Datasets and Systems , 2017, AAAI.

[29]  Licheng Yu,et al.  Visual Madlibs: Fill in the Blank Description Generation and Question Answering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  L. Bottou From machine learning to machine reasoning , 2011, Machine Learning.

[31]  Chris Dyer,et al.  The NarrativeQA Reading Comprehension Challenge , 2017, TACL.

[32]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[33]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[34]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Léon Bottou,et al.  From machine learning to machine reasoning , 2011, Machine Learning.

[36]  Earl J. Wagner,et al.  Cooking with Semantics , 2014, ACL 2014.

[37]  Christopher J.C. Burges,et al.  Towards the Machine Comprehension of Text: An Essay , 2013 .

[38]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[39]  Danqi Chen,et al.  A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task , 2016, ACL.

[40]  Dhruv Batra,et al.  Sort Story: Sorting Jumbled Images and Captions into Stories , 2016, EMNLP.