Wikipedia workload analysis for decentralized hosting

We study an access trace containing a sample of Wikipedia's traffic over a 107-day period aiming to identify appropriate replication and distribution strategies in a fully decentralized hosting environment. We perform a global analysis of the whole trace, and a detailed analysis of the requests directed to the English edition of Wikipedia. In our study, we classify client requests and examine aspects such as the number of read and save operations, significant load variations and requests for nonexisting pages. We also review proposed decentralized wiki architectures and discuss how they would handle Wikipedia's workload. We conclude that decentralized architectures must focus on applying techniques to efficiently handle read operations while maintaining consistency and dealing with typical issues on decentralized systems such as churn, unbalanced loads and malicious participating nodes.

[1]  David R. Karger,et al.  Simple Efficient Load Balancing Algorithms for Peer-to-Peer Systems , 2004, IPTPS.

[2]  Guillaume Pierre,et al.  A Decentralized Wiki Engine for Collaborative Wikipedia Hosting , 2007, WEBIST.

[3]  Pascal Molli,et al.  XWiki Concerto: A P2P Wiki System Supporting Disconnected Work , 2008, CDVE.

[4]  Pascal Molli,et al.  Data consistency for P2P collaborative editing , 2006, CSCW '06.

[5]  Bruce M. Maggs,et al.  An analysis of live streaming workloads on the internet , 2004, IMC '04.

[6]  Ludmila Cherkasova,et al.  Analysis of enterprise media server workloads: access patterns, locality, content evolution, and rates of change , 2004, IEEE/ACM Transactions on Networking.

[7]  Joseph C. Morris,et al.  DistriWiki:: a distributed peer-to-peer wiki network , 2007, WikiSym '07.

[8]  J. Voß Measuring Wikipedia , 2005 .

[9]  Jerome A. Rolia,et al.  Characterizing the scalability of a large web-based shopping system , 2001, ACM Trans. Internet Techn..

[10]  Krishna P. Gummadi,et al.  Measurement, modeling, and analysis of a peer-to-peer file-sharing workload , 2003, SOSP '03.

[11]  Hala Skaf-Molli,et al.  Peer-to-peer Semantic Wikis , 2008 .

[12]  Pascal Molli,et al.  Wooki: A P2P Wiki-Based Collaborative Writing Tool , 2007, WISE.

[13]  Jesús M. González-Barahona,et al.  Quantitative analysis of thewikipedia community of users , 2007, WikiSym '07.

[14]  David R. Karger,et al.  Simple Efficient Load-Balancing Algorithms for Peer-to-Peer Systems , 2004, SPAA '04.

[15]  Andy Schürr,et al.  Piki - A Peer-to-Peer based Wiki Engine , 2008, 2008 Eighth International Conference on Peer-to-Peer Computing.

[16]  Martin F. Arlitt,et al.  Web server workload characterization: the search for invariants , 1996, SIGMETRICS '96.

[17]  Florian Schintke,et al.  Chord#: Structured Overlay Network for Non-Uniform Load-Distribution , 2005 .

[18]  Geoffrey M. Voelker,et al.  Characterization of a Large Web Site Population with Implications for Content Delivery , 2004, WWW '04.

[19]  B. Neuman Scale in Distributed Systems , 1994 .

[20]  Felipe Ortega,et al.  Quantitative Analysis of the Wikipedia Community of Users , 2007 .

[21]  Andrew S. Tanenbaum,et al.  Dynamically Selecting Optimal Distribution Strategies for Web Documents , 2002, IEEE Trans. Computers.

[22]  Guillaume Pierre,et al.  Wikipedia Workload Analysis , 2007 .

[23]  David R. Karger,et al.  Simple Efficient Load-Balancing Algorithms for Peer-to-Peer Systems , 2006, Theory of Computing Systems.

[24]  Virgílio A. F. Almeida,et al.  Characterizing reference locality in the WWW , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[25]  Songqing Chen,et al.  Analysis of multimedia workloads with implications for internet streaming , 2005, WWW '05.

[26]  Stefan Plantikow,et al.  Transactions for Distributed Wikis on Structured Overlays , 2007, DSOM.

[27]  Hala Skaf-Molli,et al.  SWOOKI: A Peer-to-peer Semantic Wiki , 2008, SemWiki.

[28]  Dennis N. Ocholla,et al.  Proceedings of ISSI 2007 - 11th International Conference of the International Society for Scientometrics and Informetrics , 2005 .

[29]  Carsten Griwodz,et al.  Analysis of Server Workload and Client Interactions in a News-on-Demand Streaming System , 2006, Eighth IEEE International Symposium on Multimedia (ISM'06).

[30]  Anne-Marie Kermarrec,et al.  Lightweight probabilistic broadcast , 2003, TOCS.

[31]  Bernardo A. Huberman,et al.  Cooperation and quality in wikipedia , 2007, WikiSym '07.

[32]  Stefan Plantikow,et al.  A Transactional Scalable Distributed Data Store , 2008 .

[33]  Pascal Molli,et al.  Concurrency awareness in a P2P wiki system , 2008, 2008 International Symposium on Collaborative Technologies and Systems.

[34]  Ee-Peng Lim,et al.  Measuring article quality in wikipedia: models and evaluation , 2007, CIKM '07.

[35]  Rodrigo Rodrigues,et al.  Proceedings of Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two , 2022 .

[36]  Vikram Sharma,et al.  Towards a Distributed Peer Encyclopedia Model , 2007 .

[37]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.