Excavating the mother lode of human-generated text: A systematic review of research that uses the wikipedia corpus

Wikipedia provides rich, natural semi-structured texts for information retrieval.It provides semantic information for keyword extraction from varied texts.It facilitates clustering, text classification and semantic relatedness analyses.It supplies a semantically structured knowledge base for studying ontologies. Although primarily an encyclopedia, Wikipedias expansive content provides a knowledge base that has been continuously exploited by researchers in a wide variety of domains. This article systematically reviews the scholarly studies that have used Wikipedia as a data source, and investigates the means by which Wikipedia has been employed in three main computer science research areas: information retrieval, natural language processing, and ontology building. We report and discuss the research trends of the identified and examined studies. We further identify and classify a list of tools that can be used to extract data from Wikipedia, and compile a list of currently available data sets extracted from Wikipedia.

[1]  Yanjun Qi,et al.  Learning to rank with (a lot of) word features , 2010, Information Retrieval.

[2]  Ludovic Denoyer,et al.  Overview of the INEX 2008 XML Mining Track , 2008, INEX.

[3]  Timo Honkela,et al.  Negative Selection of Written Language Using Character Multiset Statistics , 2010, Journal of Computer Science and Technology.

[4]  Berthold Reinwald,et al.  BinRank: Scaling Dynamic Authority-Based Search Using Materialized Subgraphs , 2010, IEEE Trans. Knowl. Data Eng..

[5]  Antonio Toral,et al.  Exploiting Wikipedia and EuroWordNet to solve Cross-Lingual Question Answering , 2009, Inf. Sci..

[6]  Simone Paolo Ponzetto,et al.  Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution , 2006, NAACL.

[7]  Pascal Molli,et al.  Logoot-Undo: Distributed Collaborative Editing System on P2P Networks , 2010, IEEE Transactions on Parallel and Distributed Systems.

[8]  David G. Schwartz,et al.  Codifying collaborative knowledge: using Wikipedia as a basis for automated ontology learning , 2009 .

[9]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[10]  Simone Paolo Ponzetto,et al.  Knowledge Derived From Wikipedia For Computing Semantic Relatedness , 2007, J. Artif. Intell. Res..

[11]  Ulrich Furbach,et al.  Logic-Based Question Answering , 2010, KI - Künstliche Intelligenz.

[12]  Carlo Curino,et al.  Graceful database schema evolution: the PRISM workbench , 2008, Proc. VLDB Endow..

[13]  Yi Li,et al.  An empirical study of the effects of NLP components on Geographic IR performance , 2008, Int. J. Geogr. Inf. Sci..

[14]  Giuseppe Attardi,et al.  Ranking very many typed entities on wikipedia , 2007, CIKM '07.

[15]  Martin Hepp,et al.  Harvesting Wiki Consensus: Using Wikipedia Entries as Vocabulary for Knowledge Management , 2007, IEEE Internet Computing.

[16]  Ralf Krestel,et al.  Why finding entities in Wikipedia is difficult, sometimes , 2010, Information Retrieval.

[17]  Valentin Jijkoun,et al.  Overview of the WiQA Task at CLEF 2006 , 2006, CLEF.

[18]  Carlotta Domeniconi,et al.  Building semantic kernels for text classification using wikipedia , 2008, KDD.

[19]  Razvan C. Bunescu,et al.  Learning for information extraction: from named entity recognition and disambiguation to relation extraction , 2007 .

[20]  Oliver Ferschke,et al.  Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia’s Edit History , 2011, ACL.

[21]  Radim Rehurek Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms , 2011, ArXiv.

[22]  Pavel Velikhov,et al.  Accuracy estimate and optimization techniques for SimRank computation , 2008, The VLDB Journal.

[23]  Maria P. Grineva,et al.  Extracting key terms from noisy and multitheme documents , 2009, WWW '09.

[24]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[25]  Markus Krötzsch,et al.  Semantic Wikipedia , 2006, WikiSym '06.

[26]  Marc Shapiro,et al.  Consistency without concurrency control in large, dynamic systems , 2010, OPSR.

[27]  Gjergji Kasneci,et al.  YAWN: A Semantically Annotated Wikipedia XML Corpus , 2007, BTW.

[28]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[29]  James Fogarty,et al.  Intelligence in Wikipedia , 2008, AAAI.

[30]  Iryna Gurevych,et al.  Wisdom of crowds versus wisdom of linguists – measuring the semantic relatedness of words , 2009, Natural Language Engineering.

[31]  Berthold Reinwald,et al.  BinRank: Scaling Dynamic Authority-Based Search Using Materialized Subgraphs , 2010, IEEE Transactions on Knowledge and Data Engineering.

[32]  Yun Li,et al.  Keyphrase extraction based on topic relevance and term association , 2010 .

[33]  Markus Krötzsch,et al.  Semantic Wikipedia , 2007, WWW '06.

[34]  Pavel Velikhov,et al.  Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation , 2008, SYRCoDIS.

[35]  Mohammed Bennamoun,et al.  Tree-Traversing Ant Algorithm for term clustering based on featureless similarities , 2007, Data Mining and Knowledge Discovery.

[36]  Hyoung-Joo Kim,et al.  FolksoViz: A Subsumption-based Folksonomy Visualization Using the Wikipedia , 2008 .

[37]  Nigel Collier,et al.  Synonym set extraction from the biomedical literature by lexical pattern discovery , 2007, BMC Bioinformatics.

[38]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[39]  Thierry Poibeau,et al.  Towards Unrestricted, Large-Scale Acquisition of Feature-Based Conceptual Representations from Corpus Data , 2009 .

[40]  G. Caldarelli,et al.  Taxonomy and clustering in collaborative systems: The case of the on-line encyclopedia Wikipedia , 2007, 0710.3058.

[41]  Yannis Avrithis,et al.  VIRaL: Visual Image Retrieval and Localization , 2010, Multimedia Tools and Applications.

[42]  Jeffrey F. Naughton,et al.  Sparse relational data sets: issues and an application , 2008 .

[43]  Peter Mika,et al.  Learning to Tag and Tagging to Learn: A Case Study on Wikipedia , 2008, IEEE Intelligent Systems.

[44]  Eric Crestan,et al.  Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[45]  Finn Årup Nielsen,et al.  “The sum of all human knowledge”: A systematic review of scholarly research on the content of Wikipedia , 2015, J. Assoc. Inf. Sci. Technol..

[46]  Ralf Dörner,et al.  Interactive visualization for opportunistic exploration of large document collections , 2010, Inf. Syst..

[47]  Nils Diewald,et al.  Geography of social ontologies: Testing a variant of the Sapir-Whorf Hypothesis in the context of Wikipedia , 2011, Comput. Speech Lang..

[48]  Luis Alfonso Ureña López,et al.  Using web sources for improving video categorization , 2011, Journal of Intelligent Information Systems.

[49]  Alexander V. Smirnov,et al.  On the problem of Wiki texts indexing , 2009 .

[50]  Yu-Chun Wang,et al.  Web-based pattern learning for named entity translation in Korean-Chinese cross-language information retrieval , 2009, Expert Syst. Appl..

[51]  Fuchun Peng,et al.  Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[52]  Finn Årup Nielsen,et al.  Wikipedia research and tools: Review and comments , 2012 .

[53]  Wray Buntine,et al.  Topic-Specific Scoring of Documents with Discrete PCA ? , 2005 .

[54]  Claudio Carpineto,et al.  Mobile information retrieval with search results clustering: Prototypes and evaluations , 2009, J. Assoc. Inf. Sci. Technol..

[55]  Finn Årup Nielsen,et al.  The People’s Encyclopedia Under the Gaze of the Sages: A Systematic Review of Scholarly Research on Wikipedia , 2012 .

[56]  Yu-Chun Wang,et al.  Japanese-Chinese Information Retrieval With and Iterative Weightin g Scheme , 2010, J. Inf. Sci. Eng..

[57]  Yinghai Wang,et al.  Promote cooperation by localised small-world communication , 2007, 0709.0322.

[58]  Katy Börner,et al.  Analyzing and visualizing the semantic coverage of Wikipedia and its authors , 2005, Complex..

[59]  Robert D. Finn,et al.  Rfam: Wikipedia, clans and the “decimal” release , 2010, Nucleic Acids Res..

[60]  Ian H. Witten,et al.  Mining Domain-Specific Thesauri from Wikipedia: A Case Study , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[61]  Peter J. Kwantes,et al.  Comparing Methods for Single Paragraph Similarity Analysis , 2011, Top. Cogn. Sci..

[62]  L. D. Costa,et al.  Identifying the borders of mathematical knowledge , 2010 .

[63]  Luc Van Gool,et al.  World-scale mining of objects and events from community photo collections , 2008, CIVR '08.

[64]  Finn Årup Nielsen,et al.  Wikipedia in the eyes of its beholders: A systematic review of scholarly research on Wikipedia readers and readership , 2014, J. Assoc. Inf. Sci. Technol..

[65]  Aleksandr Simma,et al.  Modeling Events in Time using Cascades of Poisson Processes , 2010 .

[66]  James A. Thom,et al.  Requirements-oriented methodology for evaluating ontologies , 2009, Inf. Syst..

[67]  Roelof van Zwol,et al.  Classifying tags using open content resources , 2009, WSDM '09.

[68]  Ioannis Konstas,et al.  Categorising social tags to improve folksonomy-based recommendations , 2011, J. Web Semant..

[69]  Stefan M. Rüger,et al.  Using co‐occurrence models for placename disambiguation , 2008, Int. J. Geogr. Inf. Sci..

[70]  Saswati Mukherjee,et al.  A negative category based approach for Wikipedia document classification , 2010, Int. J. Knowl. Eng. Data Min..

[71]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[72]  Chitu Okoli,et al.  A Brief Review of Studies of Wikipedia in Peer-Reviewed Journals , 2009, 2009 Third International Conference on Digital Society.

[73]  Timothy W. Finin,et al.  Wikitology: a novel hybrid knowledge base derived from wikipedia , 2010 .

[74]  Christian Bizer,et al.  Media Meets Semantic Web - How the BBC Uses DBpedia and Linked Data to Make Connections , 2009, ESWC.

[75]  Masao Fuketa,et al.  Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary , 2010, Knowledge and Information Systems.

[76]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[77]  David Carmel,et al.  Enhancing cluster labeling using wikipedia , 2009, SIGIR.

[78]  T. Banchuen The Geographical Analog Engine: Hybrid Numeric and Semantic Similarity Measures for U.S. Cities , 2008 .

[79]  Magnus Kristinn Sigurdsson,et al.  Zeeker: A topic-based search engine , 2007 .

[80]  Michael Strube,et al.  Finding Hedges by Chasing Weasels: Hedge Detection Using Wikipedia Tags and Shallow Linguistic Features , 2009, ACL.

[81]  Fabian M. Suchanek,et al.  Yago: A Core of Semantic Knowledge Unifying WordNet and Wikipedia , 2007 .

[82]  Denis Turdakov,et al.  Automatic word sense disambiguation based on document networks , 2010, Programming and Computer Software.

[83]  Qiang Yang,et al.  Bridging Domains Using World Wide Knowledge for Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[84]  Ludovic Denoyer,et al.  The XML Wikipedia Corpus , 2006 .

[85]  Jaime G. Carbonell,et al.  Retrieval and feedback models for blog feed search , 2008, SIGIR '08.

[86]  Santosh Kumar Ray,et al.  A semantic approach for question classification using WordNet and Wikipedia , 2010, Pattern Recognit. Lett..

[87]  Fabian M. Suchanek,et al.  ESTER: efficient search on text, entities, and relations , 2007, SIGIR.

[88]  Jian Hu,et al.  Using Wikipedia knowledge to improve text classification , 2009, Knowledge and Information Systems.

[89]  Gang Wang,et al.  Understanding user's query intent with wikipedia , 2009, WWW '09.

[90]  Maryam Mahmoudi,et al.  A Persian Web Page Classifi er Applying a Combination of Content-Based and Context-Based Features , 2009 .

[91]  Dima Shepelyansky,et al.  Two-dimensional ranking of Wikipedia articles , 2010, ArXiv.

[92]  Bo Hu WiKi’mantics: interpreting ontologies with WikipediA , 2009, Knowledge and Information Systems.

[93]  Gerhard Weikum,et al.  TopX: efficient and versatile top-k query processing for semistructured data , 2007, The VLDB Journal.

[94]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[95]  Ian H. Witten,et al.  Mining Meaning from Wikipedia , 2008, Int. J. Hum. Comput. Stud..

[96]  R. Hoffmann A wiki for the life sciences where authorship matters , 2008, Nature Genetics.

[97]  Clement J. McDonald,et al.  An evaluation of medical knowledge contained in Wikipedia and its use in the LOINC database , 2010, J. Am. Medical Informatics Assoc..

[98]  James A. Thom,et al.  Entity ranking in Wikipedia: utilising categories, links and topic difficulty prediction , 2009, Information Retrieval.

[99]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[100]  Chitu Okoli,et al.  A Guide to Conducting a Systematic Literature Review of Information Systems Research , 2010 .

[101]  Yu-Chun Wang,et al.  Learning weights for translation candidates in Japanese-Chinese information retrieval , 2009, Expert Syst. Appl..

[102]  Aoying Zhou,et al.  Adaptive indexing for content-based search in P2P systems , 2008, Data Knowl. Eng..

[103]  Gabriela Csurka,et al.  Crossing textual and visual content in different application scenarios , 2009, Multimedia Tools and Applications.

[104]  Thomas S. Huang,et al.  Image Interpretation Using Large Corpus: Wikipedia , 2010, Proceedings of the IEEE.

[105]  Ralf Steinmetz,et al.  Using community-generated contents as a substitute corpus for metadata generation , 2008 .

[106]  Michael Skinner,et al.  Information arbitrage across multi-lingual Wikipedia , 2009, WSDM '09.

[107]  Daniel S. Weld,et al.  Information extraction from Wikipedia: moving down the long tail , 2008, KDD.

[108]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[109]  Hyoung-Joo Kim,et al.  Schema and constraints-based matching and merging of Topic Maps , 2007, Inf. Process. Manag..

[110]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[111]  Iryna Gurevych,et al.  Expert-Built and Collaboratively Constructed Lexical Semantic Resources , 2010, Lang. Linguistics Compass.

[112]  Sreenivas Gollapudi,et al.  An axiomatic approach for result diversification , 2009, WWW '09.

[113]  Iryna Gurevych,et al.  Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary , 2008, LREC.

[114]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[115]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[116]  Finn Årup Nielsen,et al.  Clustering of scientific citations in Wikipedia , 2008, ArXiv.

[117]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[118]  Evgeniy Gabrilovich,et al.  Feature generation for textual information retrieval using world knowledge , 2007, SIGF.

[119]  Ian H. Witten,et al.  A knowledge-based search engine powered by wikipedia , 2007, CIKM '07.

[120]  Rada Mihalcea,et al.  Linking Documents to Encyclopedic Knowledge , 2008, IEEE Intelligent Systems.

[121]  Takahiro Hara,et al.  Improving the extraction of bilingual terminology from Wikipedia , 2009, TOMCCAP.

[122]  M. de Rijke,et al.  Discovering missing links in Wikipedia , 2005, LinkKDD '05.

[123]  Ian Ruthven,et al.  The Evolution of Genre in Wikipedia , 2009, J. Lang. Technol. Comput. Linguistics.

[124]  Olga Vechtomova Facet-based opinion retrieval from blogs , 2010, Inf. Process. Manag..

[125]  Gerhard Weikum,et al.  The YAGO-NAGA approach to knowledge discovery , 2009, SGMD.

[126]  Kino High Coursey,et al.  The Value of Everything: Ranking and Association with Encyclopedic Knowledge , 2009 .

[127]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[128]  Heasoo Hwang Dynamic link-based ranking over large-scale graph- structured data , 2010 .

[129]  Carlo Curino,et al.  Schema Evolution in Wikipedia - Toward a Web Information System Benchmark , 2008, ICEIS.

[130]  Rada Mihalcea,et al.  Keywords in the mist: automated keyword extraction for very large documents and back of the book indexing , 2008 .

[131]  Iryna Gurevych,et al.  Using Wiktionary for Computing Semantic Relatedness , 2008, AAAI.

[132]  Chitu Okoli,et al.  Protocol for a systematic literature review of research on the Wikipedia , 2009, MEDES.

[133]  Chin-Wan Chung,et al.  A Wikipedia Matching Approach to Contextual Advertising , 2010, World Wide Web.

[134]  Yixin Zhong,et al.  Searching and computing for vocabularies with semantic correlations from Chinese Wikipedia (自然言語処理) , 2008 .

[135]  Gilad Mishne,et al.  Using Wikipedia at the TREC QA Track , 2004, TREC.

[136]  Yoram Louzoun,et al.  Self-emergence of knowledge trees: extraction of the Wikipedia hierarchies. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[137]  Ludovic Denoyer,et al.  The Wikipedia XML Corpus , 2006, INEX.

[138]  Dunja Mladenic,et al.  Extracting Named Entities and Relating Them over Time Based on Wikipedia , 2007, Informatica.

[139]  Maria Ruiz-Casado,et al.  Automatising the learning of lexical patterns: An application to the enrichment of WordNet by extracting semantic relationships from Wikipedia , 2007, Data Knowl. Eng..