NewsEdits: A News Article Revision Dataset and a Novel Document-Level Reasoning Challenge

News article revision histories provide clues to narrative and factual evolution in news articles. To facilitate analysis of this evolution, we present the first publicly available dataset of news revision histories, NewsEdits. Our dataset is large-scale and multilingual; it contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources based in three countries, spanning 15 years of coverage (2006-2021).We define article-level edit actions: Addition, Deletion, Edit and Refactor, and develop a high-accuracy extraction algorithm to identify these actions. To underscore the factual nature of many edit actions, we conduct analyses showing that added and deleted sentences are more likely to contain updating events, main content and quotes than unchanged sentences. Finally, to explore whether edit actions are predictable, we introduce three novel tasks aimed at predicting actions performed during version updates. We show that these tasks are possible for expert humans but are challenging for large NLP models. We hope this can spur research in narrative framing and help provide predictive tools for journalists chasing breaking news.

[1]  R. Weischedel,et al.  Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals , 2021, ACL.

[2]  Nanyun Peng,et al.  "Don't quote me on that": Finding Mixtures of Sources in News Articles , 2021, ArXiv.

[3]  O. Westlund,et al.  The Epistemologies of Breaking News , 2021 .

[4]  Nanyun Peng,et al.  EventPlus: A Temporal Event Understanding Pipeline , 2021, NAACL.

[5]  Irshad Ahmad Bhat,et al.  Towards Modeling Revision Requirements in wikiHow Instructions , 2020, EMNLP.

[6]  Yantao Jia,et al.  Scene Restoring for Narrative Machine Reading Comprehension , 2020, EMNLP.

[7]  Kate G. Blackburn,et al.  The narrative arc: Revealing core narrative structures through text analysis , 2020, Science Advances.

[8]  Diane J. Litman,et al.  Annotation and Classification of Evidence and Reasoning Revisions in Argumentative Writing , 2020, BEA.

[9]  Aaron Lee,et al.  Discourse as a Function of Event: Profiling Discourse Structure in News Articles around the Main Event , 2020, ACL.

[10]  M. de Rijke,et al.  WN-Salience: A Corpus of News Articles with Entity Salience Annotations , 2020, LREC.

[11]  Michael Roth,et al.  wikiHowToImprove: A Resource and Analyses on Edits in Instructional Texts , 2020, LREC.

[12]  Nanyun Peng,et al.  Identifying Cultural Differences through Multi-Lingual Wikipedia , 2020, ArXiv.

[13]  Diane J. Litman,et al.  eRevis(ing): Students’ revision of text evidence use in an automated writing evaluation system , 2020 .

[14]  Nanyun Peng,et al.  Man is to Person as Woman is to Location: Measuring Gender Bias in Named Entity Recognition , 2019, HT.

[15]  Raj Kumar Gupta,et al.  Predicting and Understanding News Social Popularity with Emotional Salience Features , 2019, ACM Multimedia.

[16]  Haiyan Zhao,et al.  IntelliMerge: a refactoring-aware software merging technique , 2019, Proc. ACM Program. Lang..

[17]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[18]  Darsh J. Shah,et al.  Automatic Fact-guided Sentence Modification , 2019, AAAI Conference on Artificial Intelligence.

[19]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[20]  Nanyun Peng,et al.  Deep Structured Neural Network for Event Temporal Relation Extraction , 2019, CoNLL.

[21]  Nanyun Peng,et al.  Joint Event and Temporal Relation Extraction with Shared Representations and Structured Prediction , 2019, EMNLP.

[22]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[23]  Feng Xu,et al.  Commit Message Generation for Source Code Changes , 2019, IJCAI.

[24]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[25]  Kenli Li,et al.  An Efficient Framework for Sentence Similarity Modeling , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Haipeng Yao,et al.  A novel sentence similarity model with word embedding based on convolutional neural network , 2018, Concurr. Comput. Pract. Exp..

[27]  Percy Liang,et al.  A Retrieve-and-Edit Framework for Predicting Structured Outputs , 2018, NeurIPS.

[28]  Dongyan Zhao,et al.  Plan-And-Write: Towards Better Automatic Storytelling , 2018, AAAI.

[29]  Sheikh Abujar,et al.  Sentence Similarity Estimation for Text Summarization Using Deep Learning , 2018, Proceedings of the 2nd International Conference on Data Engineering and Communication Technology.

[30]  Alexander L. Gaunt,et al.  Learning to Represent Edits , 2018, ICLR.

[31]  Manaal Faruqui,et al.  WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse , 2018, EMNLP.

[32]  Zhiyong Lu,et al.  Sentence Similarity Measures Revisited: Ranking Sentences in PubMed Documents , 2018, BCB.

[33]  Danny Dig,et al.  Accurate and Efficient Refactoring Detection in Commit History , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[34]  Hao Wu,et al.  A Multi-Axis Annotation Scheme for Event Temporal Relations , 2018, ACL.

[35]  N. Usher Breaking news production processes in US metropolitan newspapers: Immediacy and journalistic authority , 2018 .

[36]  Dongyan Zhao,et al.  Style Transfer in Text: Exploration and Evaluation , 2017, AAAI.

[37]  Christian Biemann,et al.  CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups , 2017, IJCNLP.

[38]  Naoaki Okazaki,et al.  Proofread Sentence Generation as Multi-Task Learning with Editing Operation Prediction , 2017, IJCNLP.

[39]  Naoaki Okazaki,et al.  Analyzing the Revision Logs of a Japanese Newspaper for Article Quality Assessment , 2017, NLPmJ@EMNLP.

[40]  Xiaojun Wan,et al.  Towards Automatic Construction of News Overview Articles by News Synthesis , 2017, EMNLP.

[41]  Aaron Halfaker,et al.  Identifying Semantic Edit Intentions from Revisions in Wikipedia , 2017, EMNLP.

[42]  Zhiyuan Liu,et al.  Recent Advances on Neural Headline Generation , 2017, Journal of Computer Science and Technology.

[43]  Fan Zhang,et al.  A Corpus of Annotated Revisions for Studying Argumentative Writing , 2017, ACL.

[44]  Emilio Ferrara,et al.  Disinformation and Social Bot Operations in the Run Up to the 2017 French Presidential Election , 2017, First Monday.

[45]  Marco Tulio Valente,et al.  RefDiff: Detecting Refactorings in Version Histories , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[46]  Nathanael Chambers,et al.  LSDSem 2017 Shared Task: The Story Cloze Test , 2017, LSDSem@EACL.

[47]  Mamoru Komachi,et al.  Building a Monolingual Parallel Corpus for Text Simplification Using Sentence Similarity Based on Alignment between Word Embeddings , 2016, COLING.

[48]  Erik W. Johnson,et al.  The Effect of New York Times Event Coding Techniques on Social Movement Analyses of Protest Data , 2016 .

[49]  Chantal van Son,et al.  MEANTIME, the NewsReader Multilingual Event and Time Corpus , 2016, LREC.

[50]  Nathanael Chambers,et al.  A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories , 2016, NAACL.

[51]  Motti Neiger,et al.  Understanding Journalism Through a Nuanced Deconstruction of Temporal Layers in News Narratives , 2016 .

[52]  A. Appelman,et al.  Do news corrections affect credibility? Not necessarily , 2015 .

[53]  Fan Zhang,et al.  Annotation and Classification of Argumentative Writing Revisions , 2015, BEA@NAACL-HLT.

[54]  Marcin Junczys-Dowmunt,et al.  The WikEd Error Corpus: A Corpus of Corrective Wikipedia Edits and Its Application to Grammatical Error Correction , 2014, PolTAL.

[55]  Iryna Gurevych,et al.  Automatically Classifying Edit Categories in Wikipedia Revisions , 2013, EMNLP.

[56]  Iryna Gurevych,et al.  A Corpus-Based Study of Edit Categories in Featured and Non-Featured Wikipedia Articles , 2012, COLING.

[57]  Felipe Bravo-Marquez,et al.  A Zipf-Like Distant Supervision Approach for Multi-document Summarization Using Wikinews Articles , 2012, SPIRE.

[58]  Angus Main,et al.  Revealing the news: how online news changes without you noticing , 2012, NordiCHI.

[59]  Sarah Cohen,et al.  Computational journalism , 2011, Commun. ACM.

[60]  Fabio Massimo Zanzotto,et al.  Expanding textual entailment corpora fromWikipedia using co-training , 2010, PWNLP@COLING.

[61]  Dafna Shahaf,et al.  Connecting the dots between news articles , 2010, IJCAI.

[62]  Robert P. Biuk-Aghai,et al.  What did they do? Deriving high-level edit histories in Wikis , 2010, Int. Sym. Wikis.

[63]  B. Pang,et al.  For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia , 2010, NAACL.

[64]  C. Leacock,et al.  Book Reviews: Automated Grammatical Error Detection for Language Learners by Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel Tetreault , 2010, CL.

[65]  Horst Po¨ttker News and its communicative quality: the inverted pyramid—when and why did it appear? , 2003 .

[66]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[67]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[68]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[69]  Kathleen A. Hansen,et al.  Local Breaking News: Sources, Technology, and News Routines , 1994 .

[70]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[71]  Teun A. van Dijk,et al.  Discourse Analysis: Its Development and Application to the Structure of News , 1983 .

[72]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[73]  Lingjia Deng,et al.  Multitask Semi-Supervised Learning for Class-Imbalanced Discourse Classification , 2021, EMNLP.

[74]  Nanyun Peng,et al.  Scientific Discourse Tagging for Evidence Extraction , 2021, EACL.

[75]  Nanyun Peng,et al.  On Efficient Training, Controllability and Compositional Generalization of Insertion-based Language Generators , 2021, ArXiv.

[76]  Z. Zhang,et al.  Engaging with automated writing evaluation (AWE) feedback on L2 writing: Student perceptions and revisions , 2020 .

[77]  Jorge A. Balazs,et al.  Learning to Describe Editing Activities in Collaborative Environments: A Case Study on GitHub and Wikipedia , 2020, PACLIC.

[78]  Christoph Bockisch,et al.  A Survey of Refactoring Detection Tools , 2019, Software Engineering.

[79]  Nanyun Peng,et al.  Towards Controllable Story Generation , 2018 .

[80]  Michael Neubert,et al.  Using RSS to Improve Web Harvest Results for News Web Sites , 2017 .

[81]  R. Nielsen The Uncertain Future of Local Journalism , 2015 .

[82]  Jacques Savoy,et al.  Authorship attribution based on a probabilistic topic model , 2013, Inf. Process. Manag..

[83]  Bernd Carsten Stahl,et al.  On the Difference or Equality of Information, Misinformation, and Disinformation: A Critical Research Perspective , 2006, Informing Sci. Int. J. an Emerg. Transdiscipl..