TIMEDIAL: Temporal Commonsense Reasoning in Dialog

Everyday conversations require understanding everyday events, which in turn, requires understanding temporal commonsense concepts interwoven with those events. Despite recent progress with massive pre-trained language models (LMs) such as T5 and GPT-3, their capability of temporal reasoning in dialogs remains largely under-explored. In this paper, we present the first study to investigate pre-trained LMs for their temporal reasoning capabilities in dialogs by introducing a new task and a crowd-sourced English challenge set, TIMEDIAL. We formulate TIMEDIAL as a multiple choice cloze task with over 1.1K carefully curated dialogs. Empirical results demonstrate that even the best performing models struggle on this task compared to humans, with 23 absolute points of gap in accuracy. Furthermore, our analysis reveals that the models fail to reason about dialog context correctly; instead, they rely on shallow cues based on existing temporal patterns in context, motivating future research for modeling temporal concepts in text and robust contextual reasoning about them. The dataset is publicly available at: https://github.com/ google-research-datasets/timedial.

[1]  Steven Schockaert,et al.  Inducing Relational Knowledge from BERT , 2019, AAAI.

[2]  Hao Wu,et al.  Improving Temporal Relation Extraction with a Globally Acquired Statistical Resource , 2018, NAACL.

[3]  Rahul Khanna,et al.  Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models , 2020, EMNLP.

[4]  Kenneth M. Kahn,et al.  Mechanizing Temporal Knowledge , 1977, Artif. Intell..

[5]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[6]  Hao Wu,et al.  Joint Reasoning for Temporal and Causal Relations , 2018, ACL.

[7]  David A. McAllester,et al.  Who did What: A Large-Scale Person-Centered Cloze Dataset , 2016, EMNLP.

[8]  M. Bethge,et al.  Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[9]  Wilson L. Taylor,et al.  “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[10]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[11]  Dan Roth,et al.  “Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding , 2019, EMNLP.

[12]  Ronan Le Bras,et al.  Adversarial Filters of Dataset Biases , 2020, ICML.

[13]  Bertram C. Bruce A Model for Temporal References and Its Application in a Question Answering Program , 1972, Artif. Intell..

[14]  Daniel Jurafsky,et al.  Parsing Time: Learning to Interpret Time Expressions , 2012, NAACL.

[15]  Quoc V. Le,et al.  Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[16]  Todor Mihaylov,et al.  Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External Commonsense Knowledge , 2018, ACL.

[17]  Jonathan Berant,et al.  Injecting Numerical Reasoning Skills into Language Models , 2020, ACL.

[18]  James F. Allen Towards a General Theory of Action and Time , 1984, Artif. Intell..

[19]  Nanyun Peng,et al.  TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions , 2020, EMNLP.

[20]  Jason Weston,et al.  Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[21]  Benjamin Van Durme,et al.  Temporal Reasoning in Natural Language Inference , 2020, FINDINGS.

[22]  Alexander M. Rush,et al.  Commonsense Knowledge Mining from Pretrained Models , 2019, EMNLP.

[23]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Mirella Lapata,et al.  Learning Sentence-internal Temporal Relations , 2006, J. Artif. Intell. Res..

[26]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[27]  Luke S. Zettlemoyer,et al.  Context-dependent Semantic Parsing for Time Expressions , 2014, ACL.

[28]  D. Roth,et al.  Do Language Embeddings capture Scales? , 2020, FINDINGS.

[29]  Dan Roth,et al.  Joint Inference for Event Timeline Construction , 2012, EMNLP.

[30]  Nathanael Chambers,et al.  A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories , 2016, NAACL.

[31]  Xiaoyu Shen,et al.  DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset , 2017, IJCNLP.

[32]  Yejin Choi,et al.  Counterfactual Story Reasoning and Generation , 2019, EMNLP.

[33]  Xiaodong Liu,et al.  Conversing by Reading: Contentful Neural Conversation with On-demand Machine Reading , 2019, ACL.

[34]  James Pustejovsky,et al.  Temporal and Event Information in Natural Language Text , 2005, Lang. Resour. Evaluation.

[35]  Dan Roth,et al.  Temporal Common Sense Acquisition with Minimal Supervision , 2020, ACL.

[36]  James Pustejovsky,et al.  ISO-TimeML and the Annotation of Temporal Information , 2017 .

[37]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[38]  Robert J. Gaizauskas,et al.  Annotating Events and Temporal Information in Newswire Texts , 2000, LREC.

[39]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[40]  Zornitsa Kozareva,et al.  Learning Temporal Information for States and Events , 2011, 2011 IEEE Fifth International Conference on Semantic Computing.

[41]  James Pustejovsky,et al.  SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations , 2013, *SEMEVAL.

[42]  Marie-Francine Moens,et al.  Temporal Information Extraction by Predicting Relative Time-lines , 2018, EMNLP.

[43]  Angel X. Chang,et al.  SUTime: A library for recognizing and normalizing time expressions , 2012, LREC.

[44]  Shan Wang,et al.  Classifying Temporal Relations Between Events , 2007, ACL.