Evaluation of Argument Search Approaches in the Context of Argumentative Dialogue Systems

We present an approach to evaluate argument search techniques in view of their use in argumentative dialogue systems by assessing quality aspects of the retrieved arguments. To this end, we introduce a dialogue system that presents arguments by means of a virtual avatar and synthetic speech to users and allows them to rate the presented content in four different categories (Interesting, Convincing, Comprehensible, Relation). The approach is applied in a user study in order to compare two state of the art argument search engines to each other and with a system based on traditional web search. The results show a significant advantage of the two search engines over the baseline. Moreover, the two search engines show significant advantages over each other in different categories, thereby reflecting strengths and weaknesses of the different underlying techniques.

[1]  Matthias Hagen,et al.  Data Acquisition for Argument Search: The args.me Corpus , 2019, KI.

[2]  Iryna Gurevych,et al.  What makes a convincing argument? Empirical analysis and detecting attributes of convincingness in Web argumentation , 2016, EMNLP.

[3]  Oliver Lemon,et al.  CLASSiC: D6.4: Final evaluation of classic towninfo and appointment scheduling systems , 2017 .

[4]  Michal Jacovi,et al.  Automatic Argument Quality Assessment - New Datasets and Methods , 2019, EMNLP.

[5]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[6]  Paolo Torroni,et al.  Argumentation Mining , 2016, ACM Trans. Internet Techn..

[7]  Kazuki Sakai,et al.  Creating Large-Scale Argumentation Structures for Dialogue Systems , 2018, LREC.

[8]  Yu Zhang,et al.  Personalizing a Dialogue System With Transfer Reinforcement Learning , 2016, AAAI.

[9]  Chris Reed,et al.  Argumentation Machines, New Frontiers in Argument and Computation , 2004, Argumentation Machines.

[10]  Heidi Christensen,et al.  Knowledge transfer between speakers for personalised dialogue management , 2015, SIGDIAL Conference.

[11]  Kristján Ævarsson Human-computer debating system evaluation and development , 2006 .

[12]  Iryna Gurevych,et al.  Cross-topic Argument Mining from Heterogeneous Sources , 2018, EMNLP.

[13]  Wolfgang Minker,et al.  Predicting Persuasive Effectiveness for Multimodal Behavior Adaptation using Bipolar Weighted Argument Graphs , 2020, AAMAS.

[14]  Boris A. Galitsky,et al.  Enabling a Bot with Understanding Argumentation and Providing Arguments , 2019, Developing Enterprise Chatbots.

[15]  Iryna Gurevych,et al.  Classification and Clustering of Arguments with Contextualized Word Embeddings , 2019, ACL.

[16]  David Vandyke,et al.  Stochastic Language Generation in Dialogue using Recurrent Neural Networks with Convolutional Sentence Reranking , 2015, SIGDIAL Conference.

[17]  Hui Ye,et al.  Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[18]  Benno Stein,et al.  Computational Argumentation Quality Assessment in Natural Language , 2017, EACL.

[19]  Yunkeun Lee,et al.  GenieTutor: A Computer-Assisted Second-Language Learning System Based on Spoken Language Understanding , 2015, Natural Language Dialog Systems and Intelligent Assistants.

[20]  Iryna Gurevych,et al.  ArgumenText: Searching for Arguments in Heterogeneous Sources , 2018, NAACL.

[21]  Juliana Miehle,et al.  On the Applicability of a User Satisfaction-Based Reward for Dialogue Policy Learning , 2017, IWSDS.

[22]  Matthias Hagen,et al.  Argument Search: Assessing Argument Relevance , 2019, SIGIR.

[23]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[24]  Tangming Yuan,et al.  Educational Human-computer Debate: a Computational Dialectics Approach , 2002 .

[25]  Benno Stein,et al.  Argumentation Quality Assessment: Theory vs. Practice , 2017, ACL.

[26]  Marilyn A. Walker,et al.  Towards developing general models of usability with PARADISE , 2000, Natural Language Engineering.

[27]  Sarit Kraus,et al.  Strategical Argumentative Agent for Human Persuasion , 2016, ECAI.

[28]  Matthias Hagen,et al.  TARGER: Neural Argument Mining at Your Fingertips , 2019, ACL.

[29]  Benno Stein,et al.  Building an Argument Search Engine for the Web , 2017, ArgMining@EMNLP.

[30]  Amita Misra,et al.  Debbie, the Debate Bot of the Future , 2017, IWSDS.

[31]  Chris Callison-Burch,et al.  PerspectroScope: A Window to the World of Diverse Perspectives , 2019, ACL.

[32]  Noam Slonim,et al.  Towards an argumentative content search engine using weak supervision , 2018, COLING.

[33]  Iryna Gurevych,et al.  Which argument is more convincing? Analyzing and predicting convincingness of Web arguments using bidirectional LSTM , 2016, ACL.

[34]  Tangming Yuan,et al.  A Human-Computer Dialogue System for Educational Debate: A Computational Dialectics Approach , 2008, Int. J. Artif. Intell. Educ..

[35]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[36]  Wolfgang Minker,et al.  EVA: A Multimodal Argumentative Dialogue System , 2018, ICMI.

[37]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[38]  Henry W. W. Potts,et al.  Impact of Argument Type and Concerns in Argumentation with a Chatbot , 2019, 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI).

[39]  Eyal Shnarch,et al.  Are You Convinced? Choosing the More Convincing Evidence with a Siamese Network , 2019, ACL.

[40]  Wolfgang Minker,et al.  Utilizing Argument Mining Techniques for Argumentative Dialogue Systems , 2018, IWSDS.

[41]  Cam-Tu Nguyen,et al.  Dave the debater: a retrieval-based and generative argumentative dialogue agent , 2018, ArgMining@EMNLP.

[42]  Klaus Krippendorff,et al.  Answering the Call for a Standard Reliability Measure for Coding Data , 2007 .

[43]  Wolfgang Minker,et al.  On Quality Ratings for Spoken Dialogue Systems – Experts vs. Users , 2013, NAACL.

[44]  Arantxa Otegi,et al.  Survey on evaluation methods for dialogue systems , 2019, Artificial Intelligence Review.

[45]  Matthias Hagen,et al.  Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl , 2018, ECIR.

[46]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .