Predicting Humorousness and Metaphor Novelty with Gaussian Process Preference Learning

The inability to quantify key aspects of creative language is a frequent obstacle to natural language understanding. To address this, we introduce novel tasks for evaluating the creativeness of language—namely, scoring and ranking text by humorousness and metaphor novelty. To sidestep the difficulty of assigning discrete labels or numeric scores, we learn from pairwise comparisons between texts. We introduce a Bayesian approach for predicting humorousness and metaphor novelty using Gaussian process preference learning (GPPL), which achieves a Spearman’s ρ of 0.56 against gold using word embeddings and linguistic features. Our experiments show that given sparse, crowdsourced annotation data, ranking using GPPL outperforms best–worst scaling. We release a new dataset for evaluating humour containing 28,210 pairwise comparisons of 4,030 texts, and make our software freely available.

[1]  F. Mosteller Remarks on the method of paired comparisons: I. The least squares solution assuming equal standard deviations and equal correlations , 1951 .

[2]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS , 1952 .

[3]  R. Luce,et al.  On the possible psychophysical laws. , 1959, Psychological review.

[4]  R. Plackett The Analysis of Permutations , 1975 .

[5]  G. Lakoff,et al.  Metaphors We Live by , 1982 .

[6]  Victor Raskin,et al.  Semantic mechanisms of humor , 1984 .

[7]  J. Louviere,et al.  Determining the Appropriate Response to Evidence of Public Concern: The Case of Food Safety , 1992 .

[8]  L. Thurstone A law of comparative judgment. , 1994 .

[9]  S. Attardo Linguistic theories of humor , 1994 .

[10]  L. Lippman,et al.  Contextual Connections Within Puns: Effects on Perceived Humor and Memory , 2000, The Journal of general psychology.

[11]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[12]  Christian F. Hempelmann Paronomasic puns: Target recoverability towards automatic generation , 2003 .

[13]  Seth Ovadia Ratings and rankings: reconsidering the structure of values and their measurement , 2004 .

[14]  Wei Chu,et al.  Preference learning with Gaussian processes , 2005, ICML.

[15]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[16]  J. Louviere,et al.  Some probabilistic models of best, worst, and best–worst choices , 2005 .

[17]  Andrew Y. Ng,et al.  Solving the Problem of Cascading Errors: Approximate Bayesian Inference for Linguistic Annotation Pipelines , 2006, EMNLP.

[18]  Carlo Strapparava,et al.  LEARNING TO LAUGH (AUTOMATICALLY): COMPUTATIONAL MODELS FOR HUMOR RECOGNITION , 2006, Comput. Intell..

[19]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[20]  Kenji Araki,et al.  Recognizing Humor Without Recognizing Meaning , 2007, WILF.

[21]  Christian F. Hempelmann Computational humor: Beyond the pun? , 2008 .

[22]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[23]  C. Rasmussen,et al.  Approximations for Binary Gaussian Process Classification , 2008 .

[24]  H. Carretero-Dios,et al.  Assessing the appreciation of the content and structure of humor: Construction of a new scale , 2010 .

[25]  Jonathan Dunn Gradient Semantic Intuitions of Metaphoric Expressions , 2010 .

[26]  Ekaterina Shutova,et al.  Models of Metaphor in NLP , 2010, ACL.

[27]  Gerard J. Steen,et al.  A method for linguistic metaphor identification : from MIP to MIPVU , 2010 .

[28]  Carlo Strapparava,et al.  Computational Models for Incongruity Detection in Humour , 2010, CICLing.

[29]  Brendan J. Frey,et al.  Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context , 2011, Bioinform..

[30]  Yi-Hsuan Yang,et al.  Ranking-Based Emotion Recognition for Music Organization and Retrieval , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Georgios N. Yannakakis,et al.  Ranking vs. Preference: A Comparative Study of Self-reporting , 2011, ACII.

[32]  Ivan Titov,et al.  A Bayesian Approach to Unsupervised Semantic Role Induction , 2012, EACL.

[33]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[34]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[35]  Lucia Specia,et al.  Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation , 2013, ACL.

[36]  Trevor Cohn,et al.  Predicting and Characterising User Impact on Twitter , 2014, EACL.

[37]  Lucia Specia,et al.  Joint Emotion Analysis via Multi-task Gaussian Processes , 2014, EMNLP.

[38]  Jonathan Dunn,et al.  Measuring metaphoricity , 2014, ACL.

[39]  Beata Beigman Klebanov,et al.  Different Texts, Same Metaphors: Unigrams and Beyond , 2014 .

[40]  A. Marley,et al.  Best-worst scaling: theory and methods , 2014 .

[41]  Renxian Zhang,et al.  Recognizing Humor on Twitter , 2014, CIKM.

[42]  Yulia Tsvetkov,et al.  Metaphor Detection with Cross-Lingual Model Transfer , 2014, ACL.

[43]  Dafna Shahaf,et al.  Inside Jokes: Identifying Humorous Cartoon Captions , 2015, KDD.

[44]  Ekaterina Shutova,et al.  Design and Evaluation of Metaphor Processing Systems , 2015, CL.

[45]  Steven Reece,et al.  Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning , 2015, WWW.

[46]  Diyi Yang,et al.  Humor Recognition and Humor Anchor Extraction , 2015, EMNLP.

[47]  Saif Mohammad,et al.  Capturing Reliable Fine-Grained Sentiment Associations by Crowdsourcing and Best–Worst Scaling , 2016, NAACL.

[48]  Dragomir R. Radev,et al.  Humor in Collective Discourse: Unsupervised Funniness Detection in the New Yorker Cartoon Caption Contest , 2015, LREC.

[49]  Johannes Bjerva,et al.  Detecting novel metaphor using selectional preference information , 2016 .

[50]  Eric K. Ringger,et al.  Semantic Annotation Aggregation with Conditional Crowdsourcing Models and Word Embeddings , 2016, COLING.

[51]  Elena Mikhalkova,et al.  Detecting Intentional Lexical Ambiguity in English Puns , 2017, ArXiv.

[52]  Anna Rumshisky,et al.  SemEval-2017 Task 6: #HashtagWars: Learning a Sense of Humor , 2017, *SEMEVAL.

[53]  Saif Mohammad,et al.  Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation , 2017, ACL.

[54]  Hiroyuki Kido,et al.  A Bayesian Approach to Argument-Based Reasoning for Attack Estimation , 2017, IJCAI.

[55]  Iryna Gurevych,et al.  SemEval-2017 Task 7: Detection and Interpretation of English Puns , 2017, *SEMEVAL.

[56]  Christian F. Hempelmann,et al.  Puns : Taxonomy and Phonology , 2017 .

[57]  Iryna Gurevych,et al.  Weeding out Conventionalized Metaphors: A Corpus of Novel Metaphor Annotations , 2018, EMNLP.

[58]  Xiaojun Wan,et al.  Sense-Aware Neural Models for Pun Location in Texts , 2018, ACL.

[59]  Rodney D. Nielsen,et al.  A Corpus of Metaphor Novelty Scores for Syntactically-Related Word Pairs , 2018, LREC.

[60]  Luis Chiruzzo,et al.  Overview of the HAHA Task: Humor Analysis Based on Human Annotation at IberEval 2018 , 2018, IberEval@SEPLN.

[61]  Guillermo Moncecchi,et al.  A Crowd-Annotated Spanish Corpus for Humor Analysis , 2017, SocialNLP@ACL.

[62]  Von-Wun Soo,et al.  Humor Recognition Using Deep Learning , 2018, NAACL.

[63]  Iryna Gurevych,et al.  Finding Convincing Arguments Using Scalable Bayesian Preference Learning , 2018, TACL.

[64]  Rachel Rudinger,et al.  Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , 2018, BlackboxNLP@EMNLP.