The Wikipedia Corpus

Wikipedia, the popular online encyclopedia, has in just six years grown from an adjunct to the now-defunct Nupedia to over 31 million pages and 429 million revisions in 256 languages and spawned sister projects such as Wiktionary and Wikisource. Available under the GNU Free Documentation License, it is an extraordinarily large corpus with broad scope and constant updates. Its articles are largely consistent in structure and organized into category hierarchies. However, the wiki method of collaborative editing creates challenges that must be addressed. Wikipedia’s accuracy is frequently questioned, and systemic bias means that quality and coverage are uneven, while even the variety of English dialects juxtaposed can sabotage the unwary with differences in semantics, diction and spelling. This paper examines Wikipedia from a research perspective, providing basic background knowledge and an understanding of its strengths and weaknesses. We also solve a technical challenge posed by the enormity of text (1.04TB for the English version) made available with a simple, easily-implemented dictionary compression algorithm that permits time-efficient random access to the data with a twenty-eight-fold reduction in size.

[1]  Krishnendu Chatterjee,et al.  Assigning trust to Wikipedia content , 2008, Int. Sym. Wikis.

[2]  Martin Wattenberg,et al.  Studying cooperation and conflict between authors with history flow visualizations , 2004, CHI.

[3]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[4]  Jakob Voß,et al.  Collaborative thesaurus tagging the Wikipedia way , 2006, ArXiv.

[5]  J. Giles Internet encyclopaedias go head to head , 2005, Nature.

[6]  M. Krötzsch,et al.  Wikipedia and the Semantic Web The Missing Links ? , 2005 .

[7]  Linda C. Smith,et al.  INFORMATION QUALITY DISCUSSIONS IN WIKIPEDIA , 2005 .

[8]  Deborah L. McGuinness,et al.  Investigations into Trust for Collaborative Information Repositories: A Wikipedia Case Study , 2006, MTW.

[9]  Antonio Toral,et al.  A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia , 2006, Workshop On New Text Wikis And Blogs And Other Dynamic Text Sources.

[10]  Maria Ruiz-Casado,et al.  Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets , 2005, AWIC.

[11]  Andrew Lih,et al.  Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluating collaborative media as a news resource , 2004 .

[12]  M. de Rijke,et al.  Discovering missing links in Wikipedia , 2005, LinkKDD '05.

[13]  Peter J. Denning,et al.  Wikipedia risks , 2005, CACM.

[14]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[15]  J. Voß Measuring Wikipedia , 2005 .

[16]  David N. Milne Computing Semantic Relatedness using Wikipedia Link Structure , 2007 .

[17]  Simone Paolo Ponzetto,et al.  Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution , 2006, NAACL.

[18]  Maria Ruiz-Casado,et al.  Automatic Extraction of Semantic Relationships for WordNet by Means of Pattern Learning from Wikipedia , 2005, NLDB.

[19]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[20]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[21]  Aniket Kittur,et al.  He says, she says: conflict and coordination in Wikipedia , 2007, CHI.