FlawFinder: A Modular System for Predicting Quality Flaws in Wikipedia

With over 23 million articles in 285 languages, Wikipedia is the largest free knowledge base on the web. Due to its open nature, everybody is allowed to access and edit the contents of this huge encyclopedia. As a downside of this open access policy, quality assessment of the content becomes a critical issue and is hardly manageable without computational assistance. In this paper, we present FlawFinder, a modular system for automatically predicting quality flaws in unseen Wikipedia articles. It competed in the inaugural edition of the Quality Flaw Prediction Task at the PAN Challenge 2012 and achieved the best precision of all systems and the second place in terms of recall and F1-score.

[1]  Benno Stein,et al.  Predicting quality flaws in user-generated content: the case of wikipedia , 2012, SIGIR '12.

[2]  Iryna Gurevych,et al.  A lightweight framework for reproducible parameter sweeping in information retrieval , 2011, DESIRE '11.

[3]  Dawei Jiang,et al.  Probabilistic Quality Assessment Based on Article's Revision History , 2011, DEXA.

[4]  Rada Mihalcea,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Langu , 2011, ACL 2011.

[5]  Benno Stein,et al.  On the Evolution of Quality Flaws and the Effectiveness of Cleanup Tags in the English Wikipedia , 2012 .

[6]  Linda C. Smith,et al.  A framework for information quality assessment , 2007 .

[7]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[8]  Iryna Gurevych,et al.  Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary , 2008, LREC.

[9]  Ke-Jia Chen,et al.  Web Article Quality Assessment in Multi-dimensional Space , 2011, WAIM.

[10]  Pável Calado,et al.  Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia , 2009, JCDL '09.

[11]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[12]  Dirk Riehle,et al.  Design and implementation of the Sweble Wikitext parser: unlocking the structured data of Wikipedia , 2011, Int. Sym. Wikis.

[13]  Cristina V. Lopes,et al.  Statistical measure of quality in Wikipedia , 2010, SOMA '10.

[14]  Oliver Ferschke,et al.  Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia’s Edit History , 2011, ACL.

[15]  Benno Stein,et al.  Identifying featured articles in wikipedia: writing style matters , 2010, WWW '10.

[16]  Ee-Peng Lim,et al.  Measuring article quality in wikipedia: models and evaluation , 2007, CIKM '07.

[17]  Bernardo A. Huberman,et al.  Cooperation and quality in wikipedia , 2007, WikiSym '07.

[18]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[19]  Brian Mingus,et al.  Exploring the Feasibility of Automatically Rating Online Article Quality , 2007 .

[20]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[21]  Judit Bar-Ilan,et al.  Information quality assessment of community generated content: A user study of Wikipedia , 2011, J. Inf. Sci..

[22]  Hua Zheng,et al.  Mining the Factors Affecting the Quality of Wikipedia Articles , 2010, 2010 International Conference of Information Science and Management Engineering.

[23]  Benno Stein,et al.  A breakdown of quality flaws in Wikipedia , 2012, WebQuality '12.

[24]  J. Giles Internet encyclopaedias go head to head , 2005, Nature.

[25]  Oliver Ferschke,et al.  Behind the Article: Recognizing Dialog Acts in Wikipedia Talk Pages , 2012, EACL.

[26]  Les Gasser,et al.  Information quality work organization in wikipedia , 2008, J. Assoc. Inf. Sci. Technol..