WHAD: Wikipedia historical attributes data - Historical structured data extraction and vandalism detection from the Wikipedia edit history

This paper describes the generation of temporally anchored infobox attribute data from the Wikipedia history of revisions. By mining (attribute, value) pairs from the revision history of the English Wikipedia we are able to collect a comprehensive knowledge base that contains data on how attributes change over time. When dealing with the Wikipedia edit history, vandalic and erroneous edits are a concern for data quality. We present a study of vandalism identification in Wikipedia edits that uses only features from the infoboxes, and show that we can obtain, on this dataset, an accuracy comparable to a state-of-the-art vandalism identification method that is based on the whole article. Finally, we discuss different characteristics of the extracted dataset, which we make available for further study.

[1]  Juliana Freire,et al.  Multilingual Schema Matching for Wikipedia Infoboxes , 2011, Proc. VLDB Endow..

[2]  Bernardo A. Huberman,et al.  Cooperation and quality in wikipedia , 2007, WikiSym '07.

[3]  Paolo Rosso,et al.  Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features , 2011, CICLing.

[4]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[5]  Benno Stein,et al.  Automatic Vandalism Detection in Wikipedia , 2008, ECIR.

[6]  Oded Nov,et al.  Determinants of wikipedia quality: the roles of global and local contribution inequality , 2010, CSCW '10.

[7]  Deborah L. McGuinness,et al.  Computing trust from revision history , 2006, PST.

[8]  Santiago Moisés Mola-Velasco,et al.  Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals - Lab Report for PAN at CLEF 2010 , 2012, CLEF.

[9]  Jens Lehmann,et al.  What Have Innsbruck and Leipzig in Common? Extracting Semantics from Wiki Content , 2007, ESWC.

[10]  Rada Mihalcea,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Langu , 2011, ACL 2011.

[11]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[12]  Luca de Alfaro,et al.  Detecting Wikipedia Vandalism using WikiTrust - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[13]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[14]  Martin Potthast,et al.  Overview of the 1st International Competition on Wikipedia Vandalism Detection , 2010, CLEF.

[15]  Insup Lee,et al.  Detecting Wikipedia vandalism via spatio-temporal analysis of revision metadata? , 2010, EUROSEC '10.

[16]  Oliver Ferschke,et al.  Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia’s Edit History , 2011, ACL.

[17]  Martin Potthast,et al.  Crowdsourcing a wikipedia vandalism corpus , 2010, SIGIR.

[18]  P. Ingwersen,et al.  Proceedings of ISSI 2005 – The 10th International Conference of the International Society for Scientometrics and Informetrics: Stockholm, Sweden, July 24-28, 2005 , 2005 .

[19]  Les Gasser,et al.  Assessing Information Quality of a Community-Based Encyclopedia , 2005, ICIQ.

[20]  Gerhard Weikum,et al.  TOB: Timely Ontologies for Business Relations , 2008, WebDB.

[21]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[22]  R. Stuart Geiger,et al.  The work of sustaining order in wikipedia: the banning of a vandal , 2010, CSCW '10.

[23]  Markus Krötzsch,et al.  Semantic Wikipedia , 2006, WikiSym '06.

[24]  Charles L. A. Clarke,et al.  Using dynamic markov compression to detect vandalism in the wikipedia , 2009, SIGIR.

[25]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[26]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[27]  Elif Yamangil,et al.  Mining Wikipedia Revision Histories for Improving Sentence Compression , 2008, ACL.

[28]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[29]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.

[30]  Bart Goethals,et al.  Automatic Vandalism Detection in Wikipedia : Towards a Machine Learning Approach , 2008 .

[31]  Benno Stein,et al.  Overview of the 1th International Competition on Quality Flaw Prediction in Wikipedia , 2012, CLEF.

[32]  Fabio Massimo Zanzotto,et al.  Expanding textual entailment corpora fromWikipedia using co-training , 2010, PWNLP@COLING.

[33]  Tat-Seng Chua,et al.  Summarizing Definition from Wikipedia , 2009, ACL.

[34]  Gerhard Weikum,et al.  Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia , 2010, EDBT '10.

[35]  Mitsuru Ishizuka,et al.  Exploiting Syntactic and Semantic Information for Relation Extraction from Wikipedia , 2006 .

[36]  Daniel S. Weld,et al.  Learning 5000 Relational Extractors , 2010, ACL.

[37]  Calton Pu,et al.  Elusive vandalism detection in wikipedia: a text stability-based approach , 2010, CIKM.

[38]  Gilad Mishne,et al.  Using Wikipedia at the TREC QA Track , 2004, TREC.

[39]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[40]  Padmini Srinivasan,et al.  Detecting Wikipedia vandalism with active learning and statistical language models , 2010, WICOW '10.

[41]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[42]  James Pustejovsky,et al.  TimeBank evolution as a community resource for TimeML parsing , 2007, Lang. Resour. Evaluation.

[43]  Insup Lee,et al.  Multilingual Vandalism Detection using Language-Independent & Ex Post Facto Evidence - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[44]  James Pustejovsky,et al.  The TempEval challenge: identifying temporal relations in text , 2009, Lang. Resour. Evaluation.

[45]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[46]  Songhua Xu,et al.  Keyword Extraction and Headline Generation Using Novel Word Features , 2010, Proceedings of the AAAI Conference on Artificial Intelligence.

[47]  Felix Naumann,et al.  Extracting structured information from Wikipedia articles to populate infoboxes , 2010, CIKM '10.

[48]  Cristian Danescu-Niculescu-Mizil,et al.  For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia , 2010, NAACL.

[49]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[50]  J. Fleiss,et al.  The measurement of interrater agreement , 2004 .

[51]  J. Voß Measuring Wikipedia , 2005 .