News Provenance: Revealing News Text Reuse at Web-Scale in an Augmented News Search Experience

The media industry has a practice of reusing news content, which may be a surprise to news consumers. Whether by agreement or plagiarism, a lack of explicit citations makes it difficult to understand where news comes from and how it spreads. We reveal news provenance by reconstructing the history of near-duplicate news in the web index - identifying the origins of republished content and the impact of original content. By aggregating provenance information and presenting it as part of news search results, users may be able to make more informed decisions about which articles to read and which publishers to trust. We report on early analysis and user feedback, highlighting the critical tension between the desire for media transparency and the risks of disrupting an already fragile ecosystem.

[1]  Darren Edge,et al.  Bringing AI to BI: Enabling Visual Analytics of Unstructured Data in a Modern Business Intelligence Platform , 2018, CHI Extended Abstracts.

[2]  Paul T. Groth,et al.  Requirements for Provenance on the Web , 2012, Int. J. Digit. Curation.

[3]  RUI SOUSA-S,et al.  ‘reporter fired for plagiarism’: a forensic linguistic analysis of news plagiarism , 2015 .

[4]  Joe Walsh,et al.  The Legislative Influence Detector: Finding Text Reuse in State Legislation , 2016, KDD.

[5]  Sibel Adali,et al.  An Exploration of Verbatim Content Republishing by News Producers , 2018, ArXiv.

[6]  Kate Starbird,et al.  Engage Early, Correct More: How Journalists Participate in False Rumors Online during Crisis Events , 2018, CHI.

[7]  Neha Gupta,et al.  Falling for Fake News: Investigating the Consumption of News via Social Media , 2018, CHI.

[8]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[9]  W. Bruce Croft,et al.  Finding text reuse on the web , 2009, WSDM '09.

[10]  Matthias Hagen,et al.  Wikipedia Text Reuse: Within and Without , 2018, ECIR.

[11]  John Lee,et al.  A Computational Model of Text Reuse in Ancient Literary Texts , 2007, ACL.

[12]  Yorick Wilks,et al.  Measuring Text Reuse , 2002, ACL.

[13]  Rik Van de Walle,et al.  Automatic Discovery of High-Level Provenance Using Semantic Similarity , 2012, IPAW.

[14]  W. Bruce Croft,et al.  Local text reuse detection , 2008, SIGIR '08.

[15]  M. Gentzkow,et al.  Social Media and Fake News in the 2016 Election , 2017 .

[16]  Meredith Ringel Morris,et al.  Augmenting web pages and search results to support credibility assessment , 2011, CHI.