Fine Grained Citation Span for References in Wikipedia

\emph{Verifiability} is one of the core editing principles in Wikipedia, editors being encouraged to provide citations for the added content. For a Wikipedia article, determining the \emph{citation span} of a citation, i.e. what content is covered by a citation, is important as it helps decide for which content citations are still missing. We are the first to address the problem of determining the \emph{citation span} in Wikipedia articles. We approach this problem by classifying which textual fragments in an article are covered by a citation. We propose a sequence classification approach where for a paragraph and a citation, we determine the citation span at a fine-grained level. We provide a thorough experimental evaluation and compare our approach against baselines adopted from the scientific domain, where we show improvement for all evaluation metrics.

[1]  James R. Curran,et al.  Automatically Detecting and Attributing Indirect Quotations , 2013, EMNLP.

[2]  Stephen E. Robertson,et al.  Comparing citation contexts for information retrieval , 2008, CIKM '08.

[3]  Simone Teufel,et al.  Citation Block Determination Using Textual Coherence , 2016, J. Inf. Process..

[4]  Liang Zhou,et al.  On the Summarization of Dynamically Introduced Information: Online Discussions and Blogs , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[5]  E GARFIELD,et al.  Citation indexes for science; a new dimension in documentation through association of ideas. , 2006, Science.

[6]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[7]  Manabu Okumura,et al.  Towards Multi-paper Summarization Using Reference Information , 1999, IJCAI.

[8]  Dragomir R. Radev,et al.  Reference Scope Identification in Citing Sentences , 2012, NAACL.

[9]  Avishek Anand,et al.  How much is Wikipedia Lagging Behind News? , 2015, WebSci.

[10]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[11]  Dragomir R. Radev,et al.  Identifying Non-Explicit Citing Sentences for Citation-Based Summarization. , 2010, ACL.

[12]  Ani Nenkova,et al.  Using Syntax to Disambiguate Explicit Discourse Connectives in Text , 2009, ACL.

[13]  Avishek Anand,et al.  Automated News Suggestions for Populating Wikipedia Entity Pages , 2015, CIKM.

[14]  Wolfgang Nejdl,et al.  Finding News Citations for Wikipedia , 2016, CIKM.

[15]  Dragomir R. Radev,et al.  Blind men and elephants: What do citation summaries tell us about a research article? , 2008, J. Assoc. Inf. Sci. Technol..

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  John O'Connor,et al.  Citing statements: Computer recognition and use to improve retrieval , 1982, Inf. Process. Manag..

[18]  Gerhard Weikum,et al.  Gem-based entity-knowledge maintenance , 2013, CIKM.

[19]  Ani Nenkova,et al.  The Pyramid Method: Incorporating human content selection variation in summarization evaluation , 2007, TSLP.

[20]  Dragomir R. Radev,et al.  Scientific Paper Summarization Using Citation Summary Networks , 2008, COLING.

[21]  Simone Teufel,et al.  How to Find Better Index Terms Through Citations , 2006 .