Automatic Document Sketching: Generating Drafts from Analogous Texts

The advent of large pre-trained language models has made it possible to make high-quality predictions on how to add or change a sentence in a document. However, the high branching factor inherent to text generation impedes the ability of even the strongest language models to offer useful editing suggestions at a more global or document level. We introduce a new task, DOCUMENT SKETCHING, which involves generating entire draft documents for the writer to review and revise. These drafts are built from sets of documents that overlap in form – sharing large segments of potentially reusable text – while diverging in content. To support this task, we introduce a Wikipediabased dataset of analogous documents and investigate the application of weakly supervised methods, including use of a transformer-based mixture of experts, together with reinforcement learning. We report experiments using automated and human evaluation methods and discuss relative merits of these models.

[1]  Francisco Casacuberta,et al.  A Quantitative Method for Machine Translation Evaluation , 2003 .

[2]  Jianfeng Gao,et al.  Contrastive Multi-document Question Generation , 2019, EACL.

[3]  Dongyan Zhao,et al.  How to Write Summaries with Patterns? Learning towards Abstractive Summarization through Prototype Editing , 2019, EMNLP.

[4]  Zhoujun Li,et al.  Low-Resource Response Generation with Template Prior , 2019, EMNLP/IJCNLP.

[5]  Rui Wang,et al.  BiSET: Bi-directional Selective Encoding with Template for Abstractive Summarization , 2019, ACL.

[6]  Jianfeng Gao,et al.  PlotMachines: Outline-Conditioned Generation with Dynamic Plot State Tracking , 2020, EMNLP.

[7]  Andrew M. Dai,et al.  Gmail Smart Compose: Real-Time Assisted Writing , 2019, KDD.

[8]  Percy Liang,et al.  A Retrieve-and-Edit Framework for Predicting Structured Outputs , 2018, NeurIPS.

[9]  Percy Liang,et al.  Generating Sentences by Editing Prototypes , 2017, TACL.

[10]  Preslav Nakov,et al.  Optimizing for Sentence-Level BLEU+1 Yields Short Translations , 2012, COLING.

[11]  Lucia Specia,et al.  Estimating Machine Translation Post-Editing Effort with HTER , 2010, JEC.

[12]  Zhoujun Li,et al.  Improving Neural Machine Translation with Soft Template Prediction , 2020, ACL.

[13]  Mirella Lapata,et al.  Hierarchical Transformers for Multi-Document Summarization , 2019, ACL.

[14]  Diane Litman,et al.  Abstractive Summarization for Low Resource Data using Domain Transfer and Data Synthesis , 2020, FLAIRS.

[15]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[16]  J. M. Sauder,et al.  Large‐scale comparison of protein sequence alignment algorithms with structure alignments , 2000, Proteins.

[17]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[18]  Satoshi Nakamura,et al.  Multi-Source Neural Machine Translation with Missing Data , 2018, NMT@ACL.

[19]  Yejin Choi,et al.  Deep Communicating Agents for Abstractive Summarization , 2018, NAACL.

[20]  Preslav Nakov,et al.  Analyzing Optimization for Statistical Machine Translation: MERT Learns Verbosity, PRO Learns Length , 2015, CoNLL.

[21]  Percy Liang,et al.  Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer , 2018, NAACL.

[22]  Ramakanth Pasunuru,et al.  Reinforced Video Captioning with Entailment Rewards , 2017, EMNLP.

[23]  Chris Quirk,et al.  Towards Content Transfer through Grounded Text Generation , 2019, NAACL.

[24]  Eric Chu,et al.  MeanSum: A Neural Model for Unsupervised Multi-Document Abstractive Summarization , 2018, ICML.

[25]  Jianfeng Gao,et al.  Towards Coherent and Cohesive Long-form Text Generation , 2018, Proceedings of the First Workshop on Narrative Understanding.

[26]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[27]  Christof Monz,et al.  Ensemble Learning for Multi-Source Neural Machine Translation , 2016, COLING.

[28]  Lluís Padró,et al.  Multiple Sequence Alignment for Characterizing the Lineal Structure of Revision , 2004, LREC.

[29]  Furu Wei,et al.  Retrieve, Rerank and Rewrite: Soft Template Based Neural Summarization , 2018, ACL.

[30]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[31]  Giuseppe Carenini,et al.  A Template-based Abstractive Meeting Summarization: Leveraging Summary and Source Text Relationships , 2014, INLG.

[32]  John Pavlopoulos,et al.  Toxicity Detection: Does Context Really Matter? , 2020, ACL.

[33]  Jianfeng Gao,et al.  Text Editing by Command , 2020, NAACL.

[34]  Alex Wang,et al.  Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.

[35]  M. Tatsumi Correlation between Automatic Evaluation Metric Scores, Post-Editing Speed, and Some Other Factors , 2009, MTSUMMIT.

[36]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[38]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[39]  Yulia Tsvetkov,et al.  Fortifying Toxic Speech Detectors Against Veiled Toxicity , 2020, EMNLP.

[40]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[41]  Dimitrios Alikaniotis,et al.  The Unreasonable Effectiveness of Transformer Language Models in Grammatical Error Correction , 2019, BEA@ACL.

[42]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[43]  Graham Neubig,et al.  In-IDE Code Generation from Natural Language: Promise and Challenges , 2021, ACM Trans. Softw. Eng. Methodol..

[44]  Alexander M. Rush,et al.  Challenges in Data-to-Document Generation , 2017, EMNLP.

[45]  Ramesh Nallapati,et al.  Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering , 2020, ACL.

[46]  Alexander M. Rush,et al.  Learning Neural Templates for Text Generation , 2018, EMNLP.

[47]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[48]  Rashmi Gangadharaiah,et al.  Recursive Template-based Frame Generation for Task Oriented Dialog , 2020, ACL.

[49]  Manaal Faruqui,et al.  Text Generation with Exemplar-based Adaptive Decoding , 2019, NAACL.

[50]  Ramakanth Pasunuru,et al.  Multi-Reward Reinforced Summarization with Saliency and Entailment , 2018, NAACL.

[51]  David Chiang,et al.  Correcting Length Bias in Neural Machine Translation , 2018, WMT.

[52]  Peter Young,et al.  Smart Reply: Automated Response Suggestion for Email , 2016, KDD.

[53]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.