Finding News Citations for Wikipedia

An important editing policy in Wikipedia is to provide citations for added statements in Wikipedia pages, where statements can be arbitrary pieces of text, ranging from a sentence to a paragraph. In many cases citations are either outdated or missing altogether. In this work we address the problem of finding and updating news citations for statements in entity pages. We propose a two-stage supervised approach for this problem. In the first step, we construct a classifier to find out whether statements need a news citation or other kinds of citations (web, book, journal, etc.). In the second step, we develop a news citation algorithm for Wikipedia statements, which recommends appropriate citations from a given news collection. Apart from IR techniques that use the statement to query the news collection, we also formalize three properties of an appropriate citation, namely: (i) the citation should entail the Wikipedia statement, (ii) the statement should be central to the citation, and (iii) the citation should be from an authoritative source. We perform an extensive evaluation of both steps, using 20 million articles from a real-world news collection. Our results are quite promising, and show that we can perform this task with high precision and at scale.

[1]  Monika Henzinger,et al.  Query-Free News Search , 2003, WWW '03.

[2]  Benno Stein,et al.  Predicting quality flaws in user-generated content: the case of wikipedia , 2012, SIGIR '12.

[3]  David R. Musicant,et al.  Getting to the source: where does Wikipedia get its information from? , 2013, OpenSym.

[4]  Avishek Anand,et al.  How much is Wikipedia Lagging Behind News? , 2015, WebSci.

[5]  References , 1971 .

[6]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[7]  Finn Årup Nielsen,et al.  “The sum of all human knowledge”: A systematic review of scholarly research on the content of Wikipedia , 2015, J. Assoc. Inf. Sci. Technol..

[8]  Ken Ward Church,et al.  Using Word-Sense Disambiguation Methods to Classify Web Queries by Intent , 2009, EMNLP.

[9]  Krisztian Balog,et al.  Multi-step classification approaches to cumulative citation recommendation , 2013, OAIR.

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Brendan Luyt,et al.  Improving Wikipedia's credibility: References and citations in a sample of history articles , 2010, J. Assoc. Inf. Sci. Technol..

[12]  Michael Gertz,et al.  HeidelTime: High Quality Rule-Based Extraction and Normalization of Temporal Expressions , 2010, *SEMEVAL.

[13]  Avishek Anand,et al.  Automated News Suggestions for Populating Wikipedia Entity Pages , 2015, CIKM.

[14]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[15]  Regina Barzilay,et al.  Automatically Generating Wikipedia Articles: A Structure-Aware Approach , 2009, ACL.

[16]  Rohit J. Kate A Dependency-based Word Subsequence Kernel , 2008, EMNLP.

[17]  Katja Markert,et al.  The Web Library of Babel: evaluating genre collections , 2010, LREC.

[18]  Graeme Hirst,et al.  Recognizing Textual Entailment , 2012 .

[19]  Bonnie L. Webber,et al.  Squibs: Stable Classification of Text Genres , 2011, CL.

[20]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[21]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[23]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[24]  Krisztian Balog,et al.  Cumulative citation recommendation: classification vs. ranking , 2013, SIGIR.

[25]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[26]  Ido Dagan,et al.  Recognizing Textual Entailment: Models and Applications , 2013, Recognizing Textual Entailment: Models and Applications.

[27]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.