DART: Open-Domain Structured Data Record to Text Generation

We introduce DART, a large dataset for open-domain structured data record to text generation. We consider the structured data record input as a set of RDF entity-relation triples, a format widely used for knowledge representation and semantics description. DART consists of 82,191 examples across different domains with each input being a semantic RDF triple set derived from data records in tables and the tree ontology of the schema, annotated with sentence descriptions that cover all facts in the triple set. This hierarchical, structured format with its open-domain nature differentiates DART from other existing table-to-text corpora. We conduct an analysis of DART on several state-of-the-art text generation models, showing that it introduces new and interesting challenges compared to existing datasets. Furthermore, we demonstrate that finetuning pretrained language models on DART facilitates out-of-domain generalization on the WebNLG 2017 dataset. DART is available at this https URL.

[1]  Mirella Lapata,et al.  Text Generation from Knowledge Graphs with Graph Transformers , 2019, NAACL.

[2]  Ido Dagan,et al.  Step-by-Step: Separating Planning from Realization in Neural Data-to-Text Generation , 2019, NAACL.

[3]  Jim Hunter,et al.  Choosing words in computer-generated weather forecasts , 2005, Artif. Intell..

[4]  Zhifang Sui,et al.  Table-to-text Generation by Structure-aware Seq2seq Learning , 2017, AAAI.

[5]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[6]  Mirella Lapata,et al.  Data-to-Text Generation with Content Selection and Planning , 2018, AAAI.

[7]  Lu Chen,et al.  Line Graph Enhanced AMR-to-Text Generation with Mix-Order Graph Attention Networks , 2020, ACL.

[8]  Dong Yu,et al.  Structural Information Preserving for Graph-to-Text Generation , 2020, ACL.

[9]  Jimmy Lin,et al.  Two Birds, One Stone: A Simple, Unified Model for Text Generation from Structured and Unstructured Data , 2020, ACL.

[10]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[11]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[12]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[13]  Heng Ji,et al.  Describing a Knowledge Base , 2018, INLG.

[14]  Claire Gardent,et al.  The WebNLG Challenge: Generating Text from RDF Data , 2017, INLG.

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[17]  Claire Gardent,et al.  Using Local Knowledge Graph Construction to Scale Seq2Seq Models to Multi-Document Inputs , 2019, EMNLP.

[18]  Emiel Krahmer,et al.  Linguistic realisation as machine translation: Comparing different MT models for AMR-to-text generation , 2017, INLG.

[19]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Wenhu Chen,et al.  Logical Natural Language Generation from Open-Domain Tables , 2020, ACL.

[22]  Matthew R. Walter,et al.  What to talk about and how? Selective Generation using LSTMs with Coarse-to-Fine Alignment , 2015, NAACL.

[23]  Verena Rieser,et al.  The E2E Dataset: New Challenges For End-to-End Generation , 2017, SIGDIAL Conference.

[24]  Verena Rieser,et al.  Findings of the E2E NLG Challenge , 2018, INLG.

[25]  Iryna Gurevych,et al.  Modeling Global and Local Node Contexts for Text Generation from Knowledge Graphs , 2020, Transactions of the Association for Computational Linguistics.

[26]  David Vandyke,et al.  Conditional Generation and Snapshot Learning in Neural Dialogue Systems , 2016, EMNLP.

[27]  Tiejun Zhao,et al.  Table-to-Text: Describing Table Region With Natural Language , 2018, AAAI.

[28]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[29]  Yue Zhang,et al.  AMR-to-text Generation with Synchronous Node Replacement Grammar , 2017, ACL.

[30]  Claire Gardent,et al.  The KBGen Challenge , 2013, ENLG.

[31]  Christophe Gravier,et al.  T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples , 2018, LREC.

[32]  Mirella Lapata,et al.  Unsupervised Concept-to-text Generation with Hypergraphs , 2012, NAACL.

[33]  Mirella Lapata,et al.  Data-to-text Generation with Entity Modeling , 2019, ACL.

[34]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[35]  Lucia Specia,et al.  WMDO: Fluency-based Word Mover’s Distance for Machine Translation Evaluation , 2019, WMT.

[36]  Alexander M. Rush,et al.  Challenges in Data-to-Document Generation , 2017, EMNLP.

[37]  Ido Dagan,et al.  Improving Quality and Efficiency in Plan-based Neural Data-to-Text Generation , 2019, INLG.

[38]  Dragomir R. Radev,et al.  ESPRIT: Explaining Solutions to Physical Reasoning Tasks , 2020, ACL.

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  Zhiyu Chen,et al.  Few-shot NLG with Pre-trained Language Model , 2020, ACL.

[41]  Jianfeng Gao,et al.  DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation , 2020, ACL.

[42]  Percy Liang,et al.  Transforming Question Answering Datasets Into Natural Language Inference Datasets , 2018, ArXiv.

[43]  R. B. Jones,et al.  Natural language generation in health care. , 1997, Journal of the American Medical Informatics Association : JAMIA.

[44]  Alexander M. Rush,et al.  End-to-End Content and Plan Selection for Data-to-Text Generation , 2018, INLG.

[45]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[46]  Yejin Choi,et al.  CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning , 2020, EMNLP.

[47]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[48]  Claire Gardent,et al.  Handling Rare Items in Data-to-Text Generation , 2018, INLG.

[49]  Chris Mellish,et al.  Optimising text quality in generation from relational databases , 2000, INLG.

[50]  Iryna Gurevych,et al.  Enhancing AMR-to-Text Generation with Dual Graph Representations , 2019, EMNLP.

[51]  Diyi Yang,et al.  ToTTo: A Controlled Table-To-Text Generation Dataset , 2020, EMNLP.

[52]  Hiroya Takamura,et al.  Generating Live Soccer-Match Commentary from Play Data , 2019, AAAI.

[53]  Verena Rieser,et al.  Semantic Noise Matters for Neural Natural Language Generation , 2019, INLG.

[54]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[55]  Xiaojun Wan,et al.  AMR-To-Text Generation with Graph Transformer , 2020, TACL.

[56]  Hongmin Wang,et al.  Revisiting Challenges in Data-to-Text Generation with Fact Grounding , 2020, INLG.

[57]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[58]  David Vandyke,et al.  Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems , 2015, EMNLP.

[59]  Kôiti Hasida,et al.  Reactive Content Selection in the Generation of Real-time Soccer Commentary , 1998, COLING-ACL.

[60]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[61]  Patrick Gallinari,et al.  Copy mechanism and tailored training for character-based data-to-text generation , 2019, ECML/PKDD.

[62]  Christophe Gravier,et al.  Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples , 2017, J. Web Semant..

[63]  Dan Klein,et al.  Pragmatically Informative Text Generation , 2019, NAACL.

[64]  Ankur Parikh,et al.  Handling Divergent Reference Texts when Evaluating Table-to-Text Generation , 2019, ACL.

[65]  Guodong Zhou,et al.  Modeling Graph Structure in Transformer for Better AMR-to-Text Generation , 2019, EMNLP.

[66]  Jianfeng Gao,et al.  Few-shot Natural Language Generation for Task-Oriented Dialog , 2020, FINDINGS.

[67]  Percy Liang,et al.  Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[68]  Raymond J. Mooney,et al.  Learning to sportscast: a test of grounded language acquisition , 2008, ICML '08.

[69]  Yusuke Miyao,et al.  Learning to Select, Track, and Generate for Data-to-Text , 2019, ACL.

[70]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[71]  Amir Saffari,et al.  Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity , 2020, COLING.

[72]  Diego Marcheggiani,et al.  Deep Graph Convolutional Encoders for Structured Data to Text Generation , 2018, INLG.

[73]  Shashi Narayan,et al.  Leveraging Pre-trained Checkpoints for Sequence Generation Tasks , 2019, Transactions of the Association for Computational Linguistics.

[74]  Pascal Poupart,et al.  Order-Planning Neural Text Generation From Structured Data , 2017, AAAI.

[75]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[76]  Wenhu Chen,et al.  Logic2Text: High-Fidelity Natural Language Generation from Logical Forms , 2020, EMNLP.

[77]  Liheng Chen,et al.  Triple-to-Text: Converting RDF Triples into High-Quality Natural Languages via Optimizing an Inverse KL Divergence , 2019, SIGIR.

[78]  Mihir Kale Text-to-Text Pre-Training for Data-to-Text Tasks , 2020, INLG.

[79]  Shay B. Cohen,et al.  Structural Neural Encoders for AMR-to-text Generation , 2019, NAACL.

[80]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[81]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[82]  Mirella Lapata,et al.  Bootstrapping Generators from Noisy Data , 2018, NAACL.

[83]  Shashi Narayan,et al.  Creating Training Corpora for NLG Micro-Planners , 2017, ACL.

[84]  Hao Zhou,et al.  Variational Template Machine for Data-to-Text Generation , 2020, ICLR.

[85]  Emiel Krahmer,et al.  Neural data-to-text generation: A comparison between pipeline and end-to-end architectures , 2019, EMNLP.

[86]  Emiel Krahmer,et al.  Enriching the WebNLG corpus , 2018, INLG.

[87]  Wei Wang,et al.  GTR-LSTM: A Triple Encoder for Sentence Generation from RDF Data , 2018, ACL.

[88]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[89]  Snigdha Chaturvedi,et al.  Bridging the Structural Gap Between Encoding and Decoding for Data-To-Text Generation , 2020, ACL.

[90]  Dietrich Klakow,et al.  Neural Data-to-Text Generation via Jointly Learning the Segmentation and Correspondence , 2020, ACL.

[91]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[92]  Richard I. Kittredge,et al.  Using natural-language processing to produce weather forecasts , 1994, IEEE Expert.

[93]  Jaime G. Carbonell,et al.  Generation from Abstract Meaning Representation using Tree Transducers , 2016, NAACL.

[94]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[95]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[96]  David Grangier,et al.  Neural Text Generation from Structured Data with Application to the Biography Domain , 2016, EMNLP.

[97]  Dong Yu,et al.  Towards Faithful Neural Table-to-Text Generation with Content-Matching Constraints , 2020, ACL.

[98]  Dan Klein,et al.  Learning Semantic Correspondences with Less Supervision , 2009, ACL.

[99]  Luyao Chen,et al.  CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases , 2019, EMNLP.