A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

Representation and learning of commonsense knowledge is one of the foundational problems in the quest to enable deep language understanding. This issue is particularly challenging for understanding casual and correlational relationships between events. While this topic has received a lot of interest in the NLP community, research has been hindered by the lack of a proper evaluation framework. This paper attempts to address this problem with a new framework for evaluating story understanding and script learning: the `Story Cloze Test’. This test requires a system to choose the correct ending to a four-sentence story. We created a new corpus of 50k five-sentence commonsense stories, ROCStories, to enable this evaluation. This corpus is unique in two ways: (1) it captures a rich set of causal and temporal commonsense relations between daily events, and (2) it is a high quality collection of everyday life stories that can also be used for story generation. Experimental evaluation shows that a host of baselines and state-of-the-art models based on shallow language understanding struggle to achieve a high score on the Story Cloze Test. We discuss these implications for script and story learning, and offer suggestions for deeper language understanding.

[1]  James F. Allen,et al.  Deep Semantic Analysis of Text , 2008, STEP.

[2]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[3]  Jackie Chi Kit Cheung,et al.  Probabilistic Frame Induction , 2013, NAACL.

[4]  Nathanael Chambers,et al.  CaTeRS: Causal and Temporal Relation Scheme for Semantic Annotation of Event Structures , 2016, EVENTS@HLT-NAACL.

[5]  Nathanael Chambers,et al.  Event Schema Induction with a Probabilistic Entity-Driven Model , 2013, EMNLP.

[6]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[7]  Chung Hee Hwang,et al.  Episodic Logic Meets Little Red Riding Hood: A Comprehensive, Natural Representation for Language Un , 2000 .

[8]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Paul J. Bailey Searching for Storiness: Story-Generation from a Reader's Perspective , 1999 .

[10]  Erik T. Mueller,et al.  Understanding script-based stories using commonsense reasoning , 2004, Cognitive Systems Research.

[11]  Francis Ferraro,et al.  Script Induction as Language Modeling , 2015, EMNLP.

[12]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[13]  Stephen John Turner,et al.  The Creative Process: A Computer Model of Storytelling and Creativity , 1994 .

[14]  Mirella Lapata,et al.  Learning to Tell Tales: A Data-driven Approach to Story Generation , 2009, ACL.

[15]  Nathanael Chambers,et al.  Unsupervised Learning of Narrative Schemas and their Participants , 2009, ACL.

[16]  Reid Swanson,et al.  Learning a Probabilistic Model of Event Sequences from Internet Weblog Stories , 2008, FLAIRS Conference.

[17]  Matthew Richardson,et al.  MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text , 2013, EMNLP.

[18]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[19]  Brendan T. O'Connor,et al.  Learning Latent Personas of Film Characters , 2013, ACL.

[20]  Manfred Pinkal,et al.  Learning Script Knowledge with Web Experiments , 2010, ACL.

[21]  Erik T. Mueller,et al.  Modelling Space and Time in Narratives about Restaurants , 2007, Lit. Linguistic Comput..

[22]  Zornitsa Kozareva,et al.  SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[23]  Romaric Besançon,et al.  Generative Event Schema Induction with Entity Disambiguation , 2015, ACL.

[24]  Reid Swanson,et al.  Say Anything: A Massively Collaborative Open Domain Story Writing Companion , 2008, ICIDS.

[25]  Roger C. Schank,et al.  Scripts, plans, goals and understanding: an inquiry into human knowledge structures , 1978 .

[26]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[27]  R. Swanson,et al.  Identifying Personal Stories in Millions of Weblog Entries , 2009, ICWSM 2009.

[28]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[29]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[30]  Mark O. Riedl Toward Vignette-Based Story Generation for Drama Management Systems , 2007 .

[31]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[32]  K. Davidson Aspects of a Novel , 1985, College English.

[33]  Terry Winograd,et al.  Understanding natural language , 1974 .

[34]  James Pustejovsky,et al.  TimeML: Robust Specification of Event and Temporal Expressions in Text , 2003, New Directions in Question Answering.

[35]  Akshay Java,et al.  The ICWSM 2009 Spinn3r Dataset , 2009 .

[36]  Eugene Charniak,et al.  Toward a model of children's story comprehension , 1972 .

[37]  Raymond J. Mooney,et al.  Statistical Script Learning with Multi-Argument Events , 2014, EACL.

[38]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[39]  Wilson L. Taylor,et al.  “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[40]  Pablo Gervás,et al.  A Model of Character Affinity for Agent-Based Story Generation , 2015 .

[41]  Marie-Francine Moens,et al.  Skip N-grams and Ranking Functions for Predicting Script Events , 2012, EACL.

[42]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[43]  Nathanael Chambers,et al.  Unsupervised Learning of Narrative Event Chains , 2008, ACL.

[44]  Oren Etzioni,et al.  Generating Coherent Event Schemas at Scale , 2013, EMNLP.

[45]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.