论文信息 - Excavating the mother lode of human-generated text: A systematic review of research that uses the wikipedia corpus

Excavating the mother lode of human-generated text: A systematic review of research that uses the wikipedia corpus

Wikipedia provides rich, natural semi-structured texts for information retrieval.It provides semantic information for keyword extraction from varied texts.It facilitates clustering, text classification and semantic relatedness analyses.It supplies a semantically structured knowledge base for studying ontologies. Although primarily an encyclopedia, Wikipedias expansive content provides a knowledge base that has been continuously exploited by researchers in a wide variety of domains. This article systematically reviews the scholarly studies that have used Wikipedia as a data source, and investigates the means by which Wikipedia has been employed in three main computer science research areas: information retrieval, natural language processing, and ontology building. We report and discuss the research trends of the identified and examined studies. We further identify and classify a list of tools that can be used to extract data from Wikipedia, and compile a list of currently available data sets extracted from Wikipedia.

[1] Yanjun Qi,et al. Learning to rank with (a lot of) word features , 2010, Information Retrieval.

[2] Ludovic Denoyer,et al. Overview of the INEX 2008 XML Mining Track , 2008, INEX.

[3] Timo Honkela,et al. Negative Selection of Written Language Using Character Multiset Statistics , 2010, Journal of Computer Science and Technology.

[4] Berthold Reinwald,et al. BinRank: Scaling Dynamic Authority-Based Search Using Materialized Subgraphs , 2010, IEEE Trans. Knowl. Data Eng..

[5] Antonio Toral,et al. Exploiting Wikipedia and EuroWordNet to solve Cross-Lingual Question Answering , 2009, Inf. Sci..

[6] Simone Paolo Ponzetto,et al. Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution , 2006, NAACL.

[7] Pascal Molli,et al. Logoot-Undo: Distributed Collaborative Editing System on P2P Networks , 2010, IEEE Transactions on Parallel and Distributed Systems.

[8] David G. Schwartz,et al. Codifying collaborative knowledge: using Wikipedia as a basis for automated ontology learning , 2009 .

[9] Evgeniy Gabrilovich,et al. Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[10] Simone Paolo Ponzetto,et al. Knowledge Derived From Wikipedia For Computing Semantic Relatedness , 2007, J. Artif. Intell. Res..

[11] Ulrich Furbach,et al. Logic-Based Question Answering , 2010, KI - Künstliche Intelligenz.

[12] Carlo Curino,et al. Graceful database schema evolution: the PRISM workbench , 2008, Proc. VLDB Endow..

[13] Yi Li,et al. An empirical study of the effects of NLP components on Geographic IR performance , 2008, Int. J. Geogr. Inf. Sci..

[14] Giuseppe Attardi,et al. Ranking very many typed entities on wikipedia , 2007, CIKM '07.

[15] Martin Hepp,et al. Harvesting Wiki Consensus: Using Wikipedia Entries as Vocabulary for Knowledge Management , 2007, IEEE Internet Computing.

[16] Ralf Krestel,et al. Why finding entities in Wikipedia is difficult, sometimes , 2010, Information Retrieval.

[17] Valentin Jijkoun,et al. Overview of the WiQA Task at CLEF 2006 , 2006, CLEF.

[18] Carlotta Domeniconi,et al. Building semantic kernels for text classification using wikipedia , 2008, KDD.

[19] Razvan C. Bunescu,et al. Learning for information extraction: from named entity recognition and disambiguation to relation extraction , 2007 .

[20] Oliver Ferschke,et al. Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia’s Edit History , 2011, ACL.

[21] Radim Rehurek. Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms , 2011, ArXiv.

[22] Pavel Velikhov,et al. Accuracy estimate and optimization techniques for SimRank computation , 2008, The VLDB Journal.

[23] Maria P. Grineva,et al. Extracting key terms from noisy and multitheme documents , 2009, WWW '09.

[24] Simone Paolo Ponzetto,et al. WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[25] Markus Krötzsch,et al. Semantic Wikipedia , 2006, WikiSym '06.

[26] Marc Shapiro,et al. Consistency without concurrency control in large, dynamic systems , 2010, OPSR.

[27] Gjergji Kasneci,et al. YAWN: A Semantically Annotated Wikipedia XML Corpus , 2007, BTW.

[28] Evgeniy Gabrilovich,et al. Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[29] James Fogarty,et al. Intelligence in Wikipedia , 2008, AAAI.

[30] Iryna Gurevych,et al. Wisdom of crowds versus wisdom of linguists – measuring the semantic relatedness of words , 2009, Natural Language Engineering.

[31] Berthold Reinwald,et al. BinRank: Scaling Dynamic Authority-Based Search Using Materialized Subgraphs , 2010, IEEE Transactions on Knowledge and Data Engineering.

[32] Yun Li,et al. Keyphrase extraction based on topic relevance and term association , 2010 .

[33] Markus Krötzsch,et al. Semantic Wikipedia , 2007, WWW '06.

[34] Pavel Velikhov,et al. Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation , 2008, SYRCoDIS.

[35] Mohammed Bennamoun,et al. Tree-Traversing Ant Algorithm for term clustering based on featureless similarities , 2007, Data Mining and Knowledge Discovery.

[36] Hyoung-Joo Kim,et al. FolksoViz: A Subsumption-based Folksonomy Visualization Using the Wikipedia , 2008 .

[37] Nigel Collier,et al. Synonym set extraction from the biomedical literature by lexical pattern discovery , 2007, BMC Bioinformatics.

[38] Jens Lehmann,et al. DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[39] Thierry Poibeau,et al. Towards Unrestricted, Large-Scale Acquisition of Feature-Based Conceptual Representations from Corpus Data , 2009 .

[40] G. Caldarelli,et al. Taxonomy and clustering in collaborative systems: The case of the on-line encyclopedia Wikipedia , 2007, 0710.3058.

[41] Yannis Avrithis,et al. VIRaL: Visual Image Retrieval and Localization , 2010, Multimedia Tools and Applications.

[42] Jeffrey F. Naughton,et al. Sparse relational data sets: issues and an application , 2008 .

[43] Peter Mika,et al. Learning to Tag and Tagging to Learn: A Case Study on Wikipedia , 2008, IEEE Intelligent Systems.

[44] Eric Crestan,et al. Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[45] Finn Årup Nielsen,et al. “The sum of all human knowledge”: A systematic review of scholarly research on the content of Wikipedia , 2015, J. Assoc. Inf. Sci. Technol..

[46] Ralf Dörner,et al. Interactive visualization for opportunistic exploration of large document collections , 2010, Inf. Syst..

[47] Nils Diewald,et al. Geography of social ontologies: Testing a variant of the Sapir-Whorf Hypothesis in the context of Wikipedia , 2011, Comput. Speech Lang..

[48] Luis Alfonso Ureña López,et al. Using web sources for improving video categorization , 2011, Journal of Intelligent Information Systems.

[49] Alexander V. Smirnov,et al. On the problem of Wiki texts indexing , 2009 .

[50] Yu-Chun Wang,et al. Web-based pattern learning for named entity translation in Korean-Chinese cross-language information retrieval , 2009, Expert Syst. Appl..

[51] Fuchun Peng,et al. Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[52] Finn Årup Nielsen,et al. Wikipedia research and tools: Review and comments , 2012 .

[53] Wray Buntine,et al. Topic-Specific Scoring of Documents with Discrete PCA ? , 2005 .

[54] Claudio Carpineto,et al. Mobile information retrieval with search results clustering: Prototypes and evaluations , 2009, J. Assoc. Inf. Sci. Technol..

[55] Finn Årup Nielsen,et al. The People’s Encyclopedia Under the Gaze of the Sages: A Systematic Review of Scholarly Research on Wikipedia , 2012 .

[56] Yu-Chun Wang,et al. Japanese-Chinese Information Retrieval With and Iterative Weightin g Scheme , 2010, J. Inf. Sci. Eng..

[57] Yinghai Wang,et al. Promote cooperation by localised small-world communication , 2007, 0709.0322.

[58] Katy Börner,et al. Analyzing and visualizing the semantic coverage of Wikipedia and its authors , 2005, Complex..

[59] Robert D. Finn,et al. Rfam: Wikipedia, clans and the “decimal” release , 2010, Nucleic Acids Res..

[60] Ian H. Witten,et al. Mining Domain-Specific Thesauri from Wikipedia: A Case Study , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[61] Peter J. Kwantes,et al. Comparing Methods for Single Paragraph Similarity Analysis , 2011, Top. Cogn. Sci..

[62] L. D. Costa,et al. Identifying the borders of mathematical knowledge , 2010 .

[63] Luc Van Gool,et al. World-scale mining of objects and events from community photo collections , 2008, CIVR '08.

[64] Finn Årup Nielsen,et al. Wikipedia in the eyes of its beholders: A systematic review of scholarly research on Wikipedia readers and readership , 2014, J. Assoc. Inf. Sci. Technol..

[65] Aleksandr Simma,et al. Modeling Events in Time using Cascades of Poisson Processes , 2010 .

[66] James A. Thom,et al. Requirements-oriented methodology for evaluating ontologies , 2009, Inf. Syst..

[67] Roelof van Zwol,et al. Classifying tags using open content resources , 2009, WSDM '09.

[68] Ioannis Konstas,et al. Categorising social tags to improve folksonomy-based recommendations , 2011, J. Web Semant..

[69] Stefan M. Rüger,et al. Using co‐occurrence models for placename disambiguation , 2008, Int. J. Geogr. Inf. Sci..

[70] Saswati Mukherjee,et al. A negative category based approach for Wikipedia document classification , 2010, Int. J. Knowl. Eng. Data Min..

[71] Silviu Cucerzan,et al. Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[72] Chitu Okoli,et al. A Brief Review of Studies of Wikipedia in Peer-Reviewed Journals , 2009, 2009 Third International Conference on Digital Society.

[73] Timothy W. Finin,et al. Wikitology: a novel hybrid knowledge base derived from wikipedia , 2010 .

[74] Christian Bizer,et al. Media Meets Semantic Web - How the BBC Uses DBpedia and Linked Data to Make Connections , 2009, ESWC.

[75] Masao Fuketa,et al. Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary , 2010, Knowledge and Information Systems.

[76] Somnath Banerjee,et al. Clustering short texts using wikipedia , 2007, SIGIR.

[77] David Carmel,et al. Enhancing cluster labeling using wikipedia , 2009, SIGIR.

[78] T. Banchuen. The Geographical Analog Engine: Hybrid Numeric and Semantic Similarity Measures for U.S. Cities , 2008 .

[79] Magnus Kristinn Sigurdsson,et al. Zeeker: A topic-based search engine , 2007 .

[80] Michael Strube,et al. Finding Hedges by Chasing Weasels: Hedge Detection Using Wikipedia Tags and Shallow Linguistic Features , 2009, ACL.

[81] Fabian M. Suchanek,et al. Yago: A Core of Semantic Knowledge Unifying WordNet and Wikipedia , 2007 .

[82] Denis Turdakov,et al. Automatic word sense disambiguation based on document networks , 2010, Programming and Computer Software.

[83] Qiang Yang,et al. Bridging Domains Using World Wide Knowledge for Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[84] Ludovic Denoyer,et al. The XML Wikipedia Corpus , 2006 .

[85] Jaime G. Carbonell,et al. Retrieval and feedback models for blog feed search , 2008, SIGIR '08.

[86] Santosh Kumar Ray,et al. A semantic approach for question classification using WordNet and Wikipedia , 2010, Pattern Recognit. Lett..

[87] Fabian M. Suchanek,et al. ESTER: efficient search on text, entities, and relations , 2007, SIGIR.

[88] Jian Hu,et al. Using Wikipedia knowledge to improve text classification , 2009, Knowledge and Information Systems.

[89] Gang Wang,et al. Understanding user's query intent with wikipedia , 2009, WWW '09.

[90] Maryam Mahmoudi,et al. A Persian Web Page Classifi er Applying a Combination of Content-Based and Context-Based Features , 2009 .

[91] Dima Shepelyansky,et al. Two-dimensional ranking of Wikipedia articles , 2010, ArXiv.

[92] Bo Hu. WiKi’mantics: interpreting ontologies with WikipediA , 2009, Knowledge and Information Systems.

[93] Gerhard Weikum,et al. TopX: efficient and versatile top-k query processing for semistructured data , 2007, The VLDB Journal.

[94] Rada Mihalcea,et al. Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[95] Ian H. Witten,et al. Mining Meaning from Wikipedia , 2008, Int. J. Hum. Comput. Stud..

[96] R. Hoffmann. A wiki for the life sciences where authorship matters , 2008, Nature Genetics.

[97] Clement J. McDonald,et al. An evaluation of medical knowledge contained in Wikipedia and its use in the LOINC database , 2010, J. Am. Medical Informatics Assoc..

[98] James A. Thom,et al. Entity ranking in Wikipedia: utilising categories, links and topic difficulty prediction , 2009, Information Retrieval.

[99] Evgeniy Gabrilovich,et al. Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[100] Chitu Okoli,et al. A Guide to Conducting a Systematic Literature Review of Information Systems Research , 2010 .

[101] Yu-Chun Wang,et al. Learning weights for translation candidates in Japanese-Chinese information retrieval , 2009, Expert Syst. Appl..

[102] Aoying Zhou,et al. Adaptive indexing for content-based search in P2P systems , 2008, Data Knowl. Eng..

[103] Gabriela Csurka,et al. Crossing textual and visual content in different application scenarios , 2009, Multimedia Tools and Applications.

[104] Thomas S. Huang,et al. Image Interpretation Using Large Corpus: Wikipedia , 2010, Proceedings of the IEEE.

[105] Ralf Steinmetz,et al. Using community-generated contents as a substitute corpus for metadata generation , 2008 .

[106] Michael Skinner,et al. Information arbitrage across multi-lingual Wikipedia , 2009, WSDM '09.

[107] Daniel S. Weld,et al. Information extraction from Wikipedia: moving down the long tail , 2008, KDD.

[108] Benno Stein,et al. Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[109] Hyoung-Joo Kim,et al. Schema and constraints-based matching and merging of Topic Maps , 2007, Inf. Process. Manag..

[110] Simone Paolo Ponzetto,et al. Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[111] Iryna Gurevych,et al. Expert-Built and Collaboratively Constructed Lexical Semantic Resources , 2010, Lang. Linguistics Compass.

[112] Sreenivas Gollapudi,et al. An axiomatic approach for result diversification , 2009, WWW '09.

[113] Iryna Gurevych,et al. Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary , 2008, LREC.

[114] Ian H. Witten,et al. Learning to link with wikipedia , 2008, CIKM '08.

[115] Daniel S. Weld,et al. Autonomously semantifying wikipedia , 2007, CIKM '07.

[116] Finn Årup Nielsen,et al. Clustering of scientific citations in Wikipedia , 2008, ArXiv.

[117] Jens Lehmann,et al. DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[118] Evgeniy Gabrilovich,et al. Feature generation for textual information retrieval using world knowledge , 2007, SIGF.

[119] Ian H. Witten,et al. A knowledge-based search engine powered by wikipedia , 2007, CIKM '07.

[120] Rada Mihalcea,et al. Linking Documents to Encyclopedic Knowledge , 2008, IEEE Intelligent Systems.

[121] Takahiro Hara,et al. Improving the extraction of bilingual terminology from Wikipedia , 2009, TOMCCAP.

[122] M. de Rijke,et al. Discovering missing links in Wikipedia , 2005, LinkKDD '05.

[123] Ian Ruthven,et al. The Evolution of Genre in Wikipedia , 2009, J. Lang. Technol. Comput. Linguistics.

[124] Olga Vechtomova. Facet-based opinion retrieval from blogs , 2010, Inf. Process. Manag..

[125] Gerhard Weikum,et al. The YAGO-NAGA approach to knowledge discovery , 2009, SGMD.

[126] Kino High Coursey,et al. The Value of Everything: Ranking and Association with Encyclopedic Knowledge , 2009 .

[127] Hua Li,et al. Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[128] Heasoo Hwang. Dynamic link-based ranking over large-scale graph- structured data , 2010 .

[129] Carlo Curino,et al. Schema Evolution in Wikipedia - Toward a Web Information System Benchmark , 2008, ICEIS.

[130] Rada Mihalcea,et al. Keywords in the mist: automated keyword extraction for very large documents and back of the book indexing , 2008 .

[131] Iryna Gurevych,et al. Using Wiktionary for Computing Semantic Relatedness , 2008, AAAI.

[132] Chitu Okoli,et al. Protocol for a systematic literature review of research on the Wikipedia , 2009, MEDES.

[133] Chin-Wan Chung,et al. A Wikipedia Matching Approach to Contextual Advertising , 2010, World Wide Web.

[134] Yixin Zhong,et al. Searching and computing for vocabularies with semantic correlations from Chinese Wikipedia (自然言語処理) , 2008 .

[135] Gilad Mishne,et al. Using Wikipedia at the TREC QA Track , 2004, TREC.

[136] Yoram Louzoun,et al. Self-emergence of knowledge trees: extraction of the Wikipedia hierarchies. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[137] Ludovic Denoyer,et al. The Wikipedia XML Corpus , 2006, INEX.

[138] Dunja Mladenic,et al. Extracting Named Entities and Relating Them over Time Based on Wikipedia , 2007, Informatica.

[139] Maria Ruiz-Casado,et al. Automatising the learning of lexical patterns: An application to the enrichment of WordNet by extracting semantic relationships from Wikipedia , 2007, Data Knowl. Eng..