Crowdsourcing Diverse Paraphrases for Training Task-oriented Bots

A prominent approach to build datasets for training taskoriented bots is crowd-based paraphrasing. Current approaches, however, assume the crowd would naturally provide diverse paraphrases or focus only on lexical diversity. In this WiP we addressed an overlooked aspect of diversity, introducing an approach for guiding the crowdsourcing process towards paraphrases that are syntactically diverse. Background & Motivation Task-oriented chatbots (or simply bots) enable users to interact with software-enabled services in natural language. Such interactions require bots to process utterances (i.e., user input) like “find restaurants in Milan” to identify the user’s intent. A prominent approach to build datasets for intent recognition models involves acquiring an initial set of seed utterances (for the intents) and then grow it by paraphrasing this set via crowdsourcing (Yaghoub-Zadeh-Fard et al. 2020b). An important dimension to measure quality in this context is diversity, i.e., the breath and variety of paraphrases in the resulting corpus, which dictates the ability to capture the many ways users may express an intent. In this context, paraphrasing techniques generally rely on approaches that aim at introducing lexical and syntactic variations (Thompson and Post 2020). Lexical variations refer to changes that affect individual words, such as substituting words by their synonyms (e.g., “search restaurants in Milan”). Syntactic variations, instead, refer to changes in sentence or phrasal structure, such as transforming the grammatical structure of a sentence (e.g., “Where can we eat in Milan?”). While the development of techniques to introduce such lexical and syntactic variations is the focus of ongoing work in automatic paraphrasing (Berro et al. 2021), they are currently greatly under-explored in the crowdsourcing community. Among the few contributions towards diversity, a prominent data collection framework involves turning crowdbased paraphrasing into an iterative and multi-stage pipeline. Here, multiple rounds of paraphrasing are chained together, and the seed utterances for a round come from a previous round by using different seed selection strategies (e.g., simply choosing all paraphrases from the previous round (Negri et al. 2012), random sampling (Jiang, Kummerfeld, and Lasecki 2017), or identifying outliers (Larson et al. 2019)). The focus of these strategies is to ultimately reduce the bias effect of factors like the seed utterances and examples shown to workers (Wang et al. 2012). Diversity can be further improved by focusing on the actual crowdsourcing task. This task could constraint the crowd from using frequently-used words (Larson et al. 2020) or suggest words that workers may incorporate in their paraphrases (Yaghoub-Zadeh-Fard et al. 2020a). While valuable, these contributions assume workers would naturally produce diverse paraphrases or focus primarily on lexical variations. In this paper we describe our preliminary work towards a multi-stage paraphrasing pipeline that can guide the crowdsourcing process towards producing paraphrases that are syntactically diverse and balanced. Crowdsourcing Diverse Paraphrases Figure 1 depicts our approach and where it sits in an iterative and multi-stage pipeline for crowd-based paraphrasing based on prior art (Negri et al. 2012; Kang et al. 2018; Larson et al. 2019). In this pipeline, a typical round r of data collection (black arrows) takes as input a dataset of seeds utterances X and a curated collection of paraphrases Y (initially, Y can be empty). The crowdsourcing task in the paraphrase generation step asks a worker to provide a set of n paraphrases yj for an utterance x. The resulting collection of unverified paraphrases Ȳ is fed to the paraphrase validation step, where another crowd helps to check for correctness. The correct paraphrases are then appended to the collection of curated paraphrases Y . The seed selection step updates (or fully replaces) the seeds in X by sampling from the correct paraphrases to create the set of seeds for the next round. Our approach assumes an initial (X , Y ) as input and aims to steer the crowd towards specific patterns or encourage workers to contribute novel syntactic variations to the input dataset. For these goals, we introduce a pattern selection step and propose novel prompts for paraphrase generation. Pattern selection. To capture and control syntax, we follow (Iyyer et al. 2018) and define a pattern as the top two levels of a constituency parse tree (this depth mostly has clause/phrase level nodes, making syntax comparisons less strict but still effective). The pattern selection step thus analyzes the paraphrases in Y and identifies target patterns to support the paraphrase generation step towards these goals. How to identify target patterns? For example, we may choose the k least-frequent patterns in Y as targets, or the ar X iv :2 10 9. 09 42 0v 1 [ cs .C L ] 2 0 Se p 20 21 initial seed utterances curated paraphrases crowdsourcing task paraphrase generation crowdsourcing task paraphrase validation discarded ri pattern selection unverified paraphrases seed selection seed utterances

[1]  Boualem Benatallah,et al.  A Study of Incorrect Paraphrases in Crowdsourced User Utterances , 2019, NAACL.

[2]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[3]  Eric Horvitz,et al.  Crowdsourcing the acquisition of natural language corpora: Methods and observations , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[4]  Fabio Casati,et al.  User Utterance Acquisition for Training Task-Oriented Bots: A Review of Challenges, Techniques and Opportunities , 2020, IEEE Internet Computing.

[5]  Walter S. Lasecki,et al.  Understanding Task Design Trade-offs in Crowdsourced Paraphrase Collection , 2017, ACL.

[6]  Stefan Larson,et al.  Iterative Feature Mining for Constraint-Based Data Collection to Increase Data Diversity and Model Robustness , 2020, EMNLP.

[7]  Matt Post,et al.  Paraphrase Generation as Zero-Shot Multilingual Translation: Disentangling Semantic Similarity from Lexical and Syntactic Diversity , 2020, WMT.

[8]  Lingjia Tang,et al.  Outlier Detection for Improved Data Quality and Diversity in Dialog Systems , 2019, NAACL.

[9]  Boualem Benatallah,et al.  An Extensible and Reusable Pipeline for Automated Utterance Paraphrases , 2021, Proc. VLDB Endow..

[10]  Matteo Negri,et al.  Chinese Whispers: Cooperative Paraphrase Acquisition , 2012, LREC.

[11]  Lingjia Tang,et al.  Data Collection for Dialogue System: A Startup Perspective , 2018, NAACL-HLT.

[12]  Fabio Casati,et al.  Dynamic word recommendation to obtain diverse crowdsourced paraphrases of user utterances , 2020, IUI.

[13]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[14]  Phoebe Liu,et al.  Optimizing the Design and Cost for Crowdsourced Conversational Utterances , 2019 .