Social IQA: Commonsense Reasoning about Social Interactions

We introduce Social IQa, the first largescale benchmark for commonsense reasoning about social situations. Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations (e.g., Q: "Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy. Why did Jordan do this?" A: "Make sure no one else could hear"). Through crowdsourcing, we collect commonsense questions along with correct and incorrect answers about social interactions, using a new framework that mitigates stylistic artifacts in incorrect answers by asking workers to provide the right answer to a different but related question. Empirical results show that our benchmark is challenging for existing question-answering models based on pretrained language models, compared to human performance (>20% gap). Notably, we further establish Social IQa as a resource for transfer learning of commonsense knowledge, achieving state-of-the-art performance on multiple commonsense reasoning tasks (Winograd Schemas, COPA).

[1]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[2]  Samuel R. Bowman,et al.  Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[3]  Benjamin Van Durme,et al.  Reporting bias and knowledge acquisition , 2013, AKBC '13.

[4]  Saif Mohammad,et al.  Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words , 2018, ACL.

[5]  Ernest Davis,et al.  Commonsense reasoning and commonsense knowledge in artificial intelligence , 2015, Commun. ACM.

[6]  Yejin Choi,et al.  WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.

[7]  Sanda M. Harabagiu,et al.  UTDHLT: COPACETIC System for Choosing Plausible Alternatives , 2012, *SEMEVAL.

[8]  Yejin Choi,et al.  ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning , 2019, AAAI.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Martha E. Pollack,et al.  Intelligent Technology for an Aging Population: The Use of AI to Assist Elders with Cognitive Impairment , 2005, AI Mag..

[11]  James Allen,et al.  Tackling the Story Ending Biases in The Story Cloze Test , 2018, ACL.

[12]  David Gunning,et al.  Machine Common Sense Concept Paper , 2018, ArXiv.

[13]  Jerry R. Hobbs,et al.  A Formal Theory of Commonsense Psychology: How People Think People Think , 2017 .

[14]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[16]  S. Baron-Cohen,et al.  Does the autistic child have a “theory of mind” ? , 1985, Cognition.

[17]  Jon Gauthier,et al.  Are Distributional Representations Ready for the Real World? Evaluating Word Vectors for Grounded Perceptual Meaning , 2017, RoboNLP@ACL.

[18]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[19]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Gary Marcus,et al.  Deep Learning: A Critical Appraisal , 2018, ArXiv.

[22]  Douglas B. Lenat,et al.  CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[23]  Thomas Lukasiewicz,et al.  A Surprisingly Robust Trick for the Winograd Schema Challenge , 2019, ACL.

[24]  Sheng Zhang,et al.  Ordinal Common-sense Inference , 2016, TACL.

[25]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[26]  Vincent Ng,et al.  Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge , 2012, EMNLP.

[27]  Seung-won Hwang,et al.  Commonsense Causal Reasoning between Short Texts , 2016, KR.

[28]  Thomas L. Griffiths,et al.  Evaluating Theory of Mind in Question Answering , 2018, EMNLP.

[29]  Henry Lieberman,et al.  EventNet: Inferring Temporal Relations Between Commonsense Events , 2005, MICAI.

[30]  Baris Korkmaz,et al.  Theory of Mind and Neurodevelopmental Disorders of Childhood , 2011, Pediatric Research.

[31]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[32]  S. Sawilowsky New Effect Size Rules of Thumb , 2009 .

[33]  Dan Roth,et al.  Solving Hard Coreference Problems , 2019, NAACL.

[34]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[36]  Zornitsa Kozareva,et al.  SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[37]  Naoaki Okazaki,et al.  Handling Multiword Expressions in Causality Estimation , 2017, IWCS.

[38]  C. Moore The Development of Commonsense Psychology , 2006 .

[39]  Saif Mohammad,et al.  CROWDSOURCING A WORD–EMOTION ASSOCIATION LEXICON , 2013, Comput. Intell..

[40]  Yejin Choi,et al.  The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task , 2017, CoNLL.

[41]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[42]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[43]  Catherine Havasi,et al.  Representing General Relational Knowledge in ConceptNet 5 , 2012, LREC.

[44]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[45]  I. Apperly Mindreaders: The Cognitive Basis of "Theory of Mind" , 2010 .