GenWiki: A Dataset of 1.3 Million Content-Sharing Text and Graphs for Unsupervised Graph-to-Text Generation

Data collection for the knowledge graph-to-text generation is expensive. As a result, research on unsupervised models has emerged as an active field recently. However, most unsupervised models have to use non-parallel versions of existing small supervised datasets, which largely constrain their potential. In this paper, we propose a large-scale, general-domain dataset, GenWiki. Our unsupervised dataset has 1.3M text and graph examples, respectively. With a human-annotated test set, we provide this new benchmark dataset for future research on unsupervised text generation from knowledge graphs.1

[1]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[2]  François Portet,et al.  Generation of Company descriptions using concept-to-text and text-to-text deep models: dataset collection and systems evaluation , 2018, INLG.

[3]  Heng Ji,et al.  Describing a Knowledge Base , 2018, INLG.

[4]  Eneko Agirre,et al.  An Effective Approach to Unsupervised Machine Translation , 2019, ACL.

[5]  Verena Rieser,et al.  The E2E Dataset: New Challenges For End-to-End Generation , 2017, SIGDIAL Conference.

[6]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7]  Mirella Lapata,et al.  Unsupervised Concept-to-text Generation with Hypergraphs , 2012, NAACL.

[8]  Blake Howald,et al.  Domain Adaptable Semantic Clustering in Statistical NLG , 2013, IWCS.

[9]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[10]  Markus Freitag,et al.  Unsupervised Natural Language Generation with Denoising Autoencoders , 2018, EMNLP.

[11]  David Grangier,et al.  Neural Text Generation from Structured Data with Application to the Biography Domain , 2016, EMNLP.

[12]  Dan Klein,et al.  Learning Semantic Correspondences with Less Supervision , 2009, ACL.

[13]  Alexander J. Smola,et al.  Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs , 2019, ArXiv.

[14]  Susan McRoy,et al.  YAG: A Template-Based Generator for Real-Time Systems , 2000, INLG.

[15]  Claire Gardent,et al.  The KBGen Challenge , 2013, ENLG.

[16]  Diyi Yang,et al.  ToTTo: A Controlled Table-To-Text Generation Dataset , 2020, EMNLP.

[17]  David Vandyke,et al.  Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems , 2015, EMNLP.

[18]  Dan Klein,et al.  A Simple Domain-Independent Probabilistic Approach to Generation , 2010, EMNLP.

[19]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[20]  W. Johnson,et al.  Studies in language behavior: A program of research , 1944 .

[21]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[22]  Karen Kukich,et al.  Design of a Knowledge-Based Report Generator , 1983, ACL.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Paul Holmes-Higgin Text generation - using discourse strategies and focus constraints to generate natural language text by Kathleen R. McKeown, Cambridge University Press, 1992, pp 246, £13.95, ISBN 0-521-43802-0 , 1994, Knowl. Eng. Rev..

[25]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Claire Gardent,et al.  The WebNLG Challenge: Generating Text from RDF Data , 2017, INLG.

[27]  Volker Tresp,et al.  An Unsupervised Joint System for Text Generation from Knowledge Graphs and Semantic Parsing , 2020, EMNLP.

[28]  Alexander M. Rush,et al.  Challenges in Data-to-Document Generation , 2017, EMNLP.

[29]  Raymond J. Mooney,et al.  Learning to sportscast: a test of grounded language acquisition , 2008, ICML '08.

[30]  Mirella Lapata,et al.  Text Generation from Knowledge Graphs with Graph Transformers , 2019, NAACL.

[31]  Ido Dagan,et al.  Step-by-Step: Separating Planning from Realization in Neural Data-to-Text Generation , 2019, NAACL.

[32]  Zheng Zhang,et al.  CycleGT: Unsupervised Graph-to-Text and Text-to-Graph Generation via Cycle Training , 2020, ArXiv.