Automatic Canonical Utterance Generation for Task-Oriented Bots from API Specifications

With the mind-blowing development of REST (REpresentational State Transfer) APIs (Application Programming Interfaces), many applications have been designed to harness their potential. As such, bots have recently become interesting interfaces to connect humans to APIs. Supervised approaches for building bots rely upon a large set of user utterances paired with API methods. Collecting such pairs is typically done by obtaining initial utterances for a given API method and paraphrasing them to obtain new variations. However, existing approaches for generating initial utterances (e.g., creating sentence templates) do not scale and are domain-speci!c, making bots expensive to maintain. The automatic generation of initial utterances can be considered as a supervised translation task in which an API method is translated into an utterance. However, the key challenge is the lack of training data for training domain-independent models. In this paper, we propose API2CAN, a dataset containing 14,370 pairs of API methods and utterances. The dataset is built by processing a large number of public APIs. However, deep-learning-based approaches such as sequence-to-sequence models require larger sets of training samples (ideally millions of samples). To mitigate the absence of such large datasets, we formalize and de!ne resources in REST APIs, and we propose a delexicalization technique (by converting an API method and initial utterances to tagged sequences of resources) to let deep-learning-based approaches learn from such datasets.

[1]  Eric Horvitz,et al.  Crowdsourcing the acquisition of natural language corpora: Methods and observations , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[2]  Michael S. Bernstein,et al.  Iris: A Conversational Agent for Complex Tasks , 2017, CHI.

[3]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[4]  Jonathan Berant,et al.  Building a Semantic Parser Overnight , 2015, ACL.

[5]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[6]  Yann-Gaël Guéhéneuc,et al.  Semantic Analysis of RESTful APIs for the Detection of Linguistic Patterns and Antipatterns , 2017, Int. J. Cooperative Inf. Syst..

[7]  Monica S. Lam,et al.  Almond: The Architecture of an Open, Crowdsourced, Privacy-Preserving, Programmable Virtual Assistant , 2017, WWW.

[8]  Traian Rebedea,et al.  Neural Paraphrase Generation using Transfer Learning , 2017, INLG.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Yoshua Bengio,et al.  Overcoming the Curse of Sentence Length for Neural Machine Translation using Automatic Segmentation , 2014, SSST@EMNLP.

[11]  Martin Hirzel,et al.  Generating chat bots from web API specifications , 2017, Onward!.

[12]  Yann-Gaël Guéhéneuc,et al.  Are RESTful APIs Well-Designed? Detection of their Linguistic (Anti)Patterns , 2015, ICSOC.

[13]  Md. Faisal Mahbub Chowdhury,et al.  Bootstrapping Chatbots for Novel Domains , 2017 .

[14]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[15]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[16]  Mukund Sundararajan,et al.  Analyza: Exploring Data with Conversation , 2017, IUI.

[17]  Michael Gamon,et al.  Building Natural Language Interfaces to Web APIs , 2017, CIKM.

[18]  Yann-Gaël Guéhéneuc,et al.  Detection of REST Patterns and Antipatterns: A Heuristics-Based Approach , 2014, ICSOC.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[21]  Matthias Grabmair,et al.  How Would You Say It? Eliciting Lexically Diverse Dialogue for Supervised Semantic Parsing , 2017, SIGDIAL Conference.

[22]  Cesare Pautasso,et al.  RESTful web services: principles, patterns, emerging technologies , 2010, WWW '10.

[23]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Mirella Lapata,et al.  Learning to Paraphrase for Question Answering , 2017, EMNLP.

[26]  Fabio Casati,et al.  REST APIs: A Large-Scale Analysis of Compliance with Principles and Best Practices , 2016, ICWE.

[27]  Yann-Gaël Guéhéneuc,et al.  Are REST APIs for Cloud Computing Well-Designed? An Exploratory Study , 2016, ICSOC.

[28]  Ming Zhou,et al.  Question Generation for Question Answering , 2017, EMNLP.

[29]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[30]  Matteo Negri,et al.  Chinese Whispers: Cooperative Paraphrase Acquisition , 2012, LREC.

[31]  Gerhard Weikum,et al.  YAGO: A Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames , 2016, SEMWEB.

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[33]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[34]  Ankush Gupta,et al.  A Deep Generative Framework for Paraphrase Generation , 2017, AAAI.

[35]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[36]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[37]  Bo Liu,et al.  Neural Clinical Paraphrase Generation with Attention , 2016, ClinicalNLP@COLING 2016.

[38]  Richard Nock,et al.  D-PAGE: Diverse Paraphrase Generation , 2018, ArXiv.

[39]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[40]  Roy T. Fielding,et al.  Hypertext Transfer Protocol - HTTP/1.1 , 1997, RFC.

[41]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[42]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[43]  Mirella Lapata,et al.  Paraphrasing Revisited with Neural Machine Translation , 2017, EACL.

[44]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[45]  Ting-Hao Huang,et al.  Guardian: A Crowd-Powered Spoken Dialog System for Web APIs , 2015, HCOMP.

[46]  Boualem Benatallah,et al.  A Study of Incorrect Paraphrases in Crowdsourced User Utterances , 2019, NAACL.