SportSett:Basketball - A robust and maintainable data-set for Natural Language Generation

Data2Text Natural Language Generation is a complex and varied task. We investigate the data requirements for the difficult real-world problem of generating statistic-focused summaries of basketball games. This has recently been tackled using the Rotowire and Rotowire-FG datasets of paired data and text. It can, however, be difficult to filter, query, and maintain such large volumes of data. In this resource paper, we introduce the SportSett:Basketball database1. This easy-to-use resource allows for simple scripts to be written which generate data in suitable formats for a variety of systems. Building upon the existing data, we provide more attributes, across multiple dimensions, increasing the overlap of content between data and text. We also highlight and resolve issues of training, validation and test partition contamination in these previous datasets.

[1]  Mirella Lapata,et al.  Data-to-Text Generation with Content Selection and Planning , 2018, AAAI.

[2]  Quoc V. Le,et al.  Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[3]  Diyi Yang,et al.  ToTTo: A Controlled Table-To-Text Generation Dataset , 2020, EMNLP.

[4]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[5]  Claire Gardent,et al.  The WebNLG Challenge: Generating Text from RDF Data , 2017, INLG.

[6]  Xiaocheng Feng,et al.  Table-to-Text Generation with Effective Hierarchical Encoder on Three Dimensions (Row, Column and Time) , 2019, EMNLP.

[7]  K. McKeown,et al.  Discourse Strategies for Generating Natural-Language Text , 1985, Artif. Intell..

[8]  Kathleen McKeown,et al.  Statistical Acquisition of Content Selection Rules for Natural Language Generation , 2003, EMNLP.

[9]  Emiel Krahmer,et al.  Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..

[10]  Alexander M. Rush,et al.  Challenges in Data-to-Document Generation , 2017, EMNLP.

[11]  Verena Rieser,et al.  Semantic Noise Matters for Neural Natural Language Generation , 2019, INLG.

[12]  Hongmin Wang,et al.  Revisiting Challenges in Data-to-Text Generation with Fact Grounding , 2020, INLG.

[13]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[14]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[15]  Verena Rieser,et al.  Findings of the E2E NLG Challenge , 2018, INLG.

[16]  Mirella Lapata,et al.  Data-to-text Generation with Entity Modeling , 2019, ACL.

[17]  David Grangier,et al.  Neural Text Generation from Structured Data with Application to the Biography Domain , 2016, EMNLP.

[18]  Patrick Gallinari,et al.  A Hierarchical Model for Data-to-Text Generation , 2019, ECIR.

[19]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[20]  Walter Kintsch,et al.  Comprehension: A Paradigm for Cognition , 1998 .

[21]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[22]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .

[23]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[24]  Ehud Reiter,et al.  A Structured Review of the Validity of BLEU , 2018, CL.

[25]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.