论文信息 - Structure-Aware Procedural Text Generation From an Image Sequence

Structure-Aware Procedural Text Generation From an Image Sequence

It is an important activity for our society to create new value by combining materials. From daily cooking to manufacturing for industry, we often describe the way to do it as a procedural text. As pointed by some previous studies for natural language understanding, one important property of the procedural text is its dependency of the context, which is the merging operations of materials and can be represented by a graph or tree structure. This paper aims to investigate the impact of explicitly introducing such a structure on the vision and language task of procedural text generation from an image sequence. To this end, we propose (1) a new dataset, which extends a definition of a tree structure merging tree to a vision and language version and (2) a novel structure-aware procedural text generation model, which learns the context dependency efficiently. Experimental results show that the proposed method can boost the performance of traditional versatile methods.

[1] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[2] Eric Nyberg,et al. Storyboarding of Recipes: Grounded Contextual Generation , 2019, ACL.

[3] Byoung-Tak Zhang,et al. GLAC Net: GLocal Attention Cascading Networks for Multi-image Cued Story Generation , 2018, ArXiv.

[4] Yejin Choi,et al. Globally Coherent Text Generation with Neural Checklist Models , 2016, EMNLP.

[5] Francis Ferraro,et al. Visual Storytelling , 2016, NAACL.

[6] Ivan Laptev,et al. Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Matteo Pagliardini,et al. Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[8] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[9] Christopher D. Manning,et al. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[10] Amaia Salvador,et al. Inverse Cooking: Recipe Generation From Food Images , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Philipp Koehn,et al. Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[12] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14] Nazli Ikizler-Cinbis,et al. RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes , 2018, EMNLP.

[15] Amaia Salvador,et al. Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Thomas Serre,et al. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17] Roland Vollgraf,et al. Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[18] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[19] Yoshio Momouchi,et al. Control Structures for Actions in Procedural Texts and PT-Chart , 1980, COLING.

[20] Graham Neubig,et al. Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis , 2011, ACL.

[21] Chin-Yew Lin,et al. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[22] Jun Harashima,et al. Cookpad Image Dataset: An Image Collection as Infrastructure for Food Research , 2017, SIGIR.

[23] Nizar Habash,et al. Predicting the Structure of Cooking Recipes , 2015, EMNLP.

[24] Juan Carlos Niebles,et al. Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[26] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[27] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28] Max Welling,et al. Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[29] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Jianfeng Gao,et al. A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[31] Shinsuke Mori,et al. Procedural Text Generation from a Photo Sequence , 2019, INLG.

[32] Po-Sen Huang,et al. Discourse-Aware Neural Rewards for Coherent Text Generation , 2018, NAACL.

[33] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[34] Yoko Yamakata,et al. Flow Graph Corpus from Recipe Texts , 2014, LREC.

[35] Silvio Savarese,et al. Neural Task Graphs: Generalizing to Unseen Tasks From a Single Video Demonstration , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Ioannis Konstas,et al. SEQˆ3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression , 2019, NAACL.

[37] Ivan Laptev,et al. Cross-Task Weakly Supervised Learning From Instructional Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Christopher Burgess,et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[39] Yejin Choi,et al. Mise en Place: Unsupervised Interpretation of Instructional Recipes , 2015, EMNLP.

[40] Chunyan Miao,et al. Structure-Aware Generation Network for Recipe Generation from Images , 2020, ECCV.

[41] Yu-Gang Jiang,et al. Multi-modal Cooking Workflow Construction for Food Recipes , 2020, ACM Multimedia.