Neural Paraphrasing by Automatically Crawled and Aligned Sentence Pairs

Paraphrasing is the task of re-writing an input text using other words, without altering the meaning of the original content. Conversational systems can exploit automatic paraphrasing to make the conversation more natural, e.g., talking about a certain topic using different paraphrases in different time instants. Recently, the task of automatically generating paraphrases has been approached in the context of Natural Language Generation (NLG). While many existing systems simply consist in rule-based models, the recent success of the Deep Neural Networks in several NLG tasks naturally suggests the possibility of exploiting such networks for generating paraphrases. However, the main obstacle toward neural-network-based paraphrasing is the lack of large datasets with aligned pairs of sentences and paraphrases, that are needed to efficiently train the neural models. In this paper we present a method for the automatic generation of large aligned corpora, that is based on the assumption that news and blog websites talk about the same events using different narrative styles. We propose a similarity search procedure with linguistic constraints that, given a reference sentence, is able to locate the most similar candidate paraphrases out from millions of indexed sentences. The data generation process is evaluated in the case of the Italian language, performing experiments using pointer-based deep neural architectures.

[1]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[2]  Hang Li,et al.  “ Tony ” DNN Embedding for “ Tony ” Selective Read for “ Tony ” ( a ) Attention-based Encoder-Decoder ( RNNSearch ) ( c ) State Update s 4 SourceVocabulary Softmax Prob , 2016 .

[3]  Marcello Sanguineti,et al.  Learning With Mixed Hard/Soft Pointwise Constraints , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[5]  Marco Gori,et al.  Unsupervised Learning by Minimal Entropy Encoding , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[7]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Marco Gori,et al.  Variational Foundations of Online Backpropagation , 2013, ICANN.

[10]  Marco Gori,et al.  Constraint Verification With Kernel Machines , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Vito Pirrelli,et al.  The PAISÀ Corpus of Italian Web Texts , 2014, WaC@EACL.

[12]  Nitin Madnani,et al.  Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods , 2010, CL.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[15]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[16]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[17]  Thang Nguyen,et al.  Semantic sentence embeddings for paraphrasing and text summarization , 2017, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[18]  Regina Barzilay,et al.  Extracting Paraphrases from a Parallel Corpus , 2001, ACL.

[19]  Marcello Sanguineti,et al.  A theoretical framework for supervised learning from regions , 2014, Neurocomputing.

[20]  Alexander F. Gelbukh,et al.  Synonymous Paraphrasing Using WordNet and Internet , 2004, NLDB.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.