Revision Graph Extraction in Wikipedia Based on Supergram Decomposition and Sliding Update

SUMMARY As one of the popular social media that many people turn to in recent years, collaborative encyclopedia Wikipedia provides information in a more “Neutral Point of View” way than others. Towards this core principle, plenty of efforts have been put into collaborative contribution and editing. The trajectories of how such collaboration appears by revisions are valuable for group dynamics and social media research, which suggest that we should extract the underlying derivation relationships among revisions from chronologically-sorted revision history in a precise way. In this paper, we propose a revision graph extraction method based on supergram decomposition in the document collection of near-duplicates. The plain text of revisions would be measured by its frequency distribution of supergram, which is the variable-length token sequence that keeps the same through revisions. We show that this method can effectively perform the task than existing methods.

[1]  Cao Zhe,et al.  Wikipedia version tree reconstruction by clustering revisions through keywords , 2011 .

[2]  Mikalai Sabel Structuring wiki revision history , 2007, WikiSym '07.

[3]  John Riedl,et al.  rv you're dumb: identifying discarded work in Wiki article history , 2009, Int. Sym. Wikis.

[4]  Mizuho Iwaihara,et al.  Wikipedia version tree reconstruction by clustering revisions through keywords (データ工学) , 2011 .

[5]  Iryna Gurevych,et al.  A Corpus-Based Study of Edit Categories in Featured and Non-Featured Wikipedia Articles , 2012, COLING.

[6]  Ian H. Witten,et al.  Mining Meaning from Wikipedia , 2008, Int. J. Hum. Comput. Stud..

[7]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[8]  Fabian Flöck,et al.  Revisiting reverts: accurate revert detection in wikipedia , 2012, HT '12.

[9]  Linda C. Smith,et al.  Information quality work organization in wikipedia , 2008 .

[10]  Evgeniy Gabrilovich,et al.  Using the past to score the present: extending term weighting models through revision history analysis , 2010, CIKM.

[11]  Robert P. Biuk-Aghai,et al.  What did they do? Deriving high-level edit histories in Wikis , 2010, Int. Sym. Wikis.

[12]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[13]  Jianmin Wu,et al.  Wikipedia Revision Graph Extraction Based on N-Gram Cover , 2012, WAIM Workshops.

[14]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[15]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[16]  Piotr Indyk,et al.  Nearest Neighbors in High-Dimensional Spaces , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[17]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[18]  Darren Gergle,et al.  Staying in the loop: structure and dynamics of Wikipedia's breaking news collaborations , 2012, WikiSym '12.

[19]  Andrew Lih,et al.  Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluating collaborative media as a news resource , 2004 .

[20]  Jianmin Wu,et al.  Revision graph extraction in Wikipedia based on supergram decomposition , 2013, OpenSym.

[21]  Taha Yasseri,et al.  Value Production in a Collaborative Environment , 2012, Journal of Statistical Physics.

[22]  Asta Bäck,et al.  Social Media Roadmaps: Exploring the futures triggered by social media , 2008 .

[23]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).