Article quality classification on Wikipedia: introducing document embeddings and content features

The quality of articles on the Wikipedia platform is vital for its success. Currently, the assessment of quality is performed manually by the Wikipedia community, where editors classify articles into pre-defined quality classes. However, this approach is hardly scalable and hence, approaches for the automatic classification have been investigated. In this paper, we extend this previous line of research on article quality classification by extending the set of features with novel content and edit features (e.g., document em-beddings of articles). We propose a classification approach utilizing gradient boosted trees based on this novel, extended set of features extracted from Wikipedia articles. Based on an established dataset containing Wikipedia articles and quality classes, we show that our approach is able to substantially outperform previous approaches (also including recent deep learning methods). Furthermore, we shed light on the contribution of individual features and show that the proposed features indeed capture the quality of an article well.

[1]  Claudia-Lavinia Ignat,et al.  An end-to-end learning solution for assessing the quality of Wikipedia articles , 2017, OpenSym.

[2]  Hua Zheng,et al.  Mining the Factors Affecting the Quality of Wikipedia Articles , 2010, 2010 International Conference of Information Science and Management Engineering.

[3]  Claudia-Lavinia Ignat,et al.  Measuring Quality of Collaboratively Edited Documents: The Case of Wikipedia , 2016, 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC).

[4]  John Riedl,et al.  Tell me more: an actionable quality model for Wikipedia , 2013, OpenSym.

[5]  Pável Calado,et al.  Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia , 2009, JCDL '09.

[6]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[7]  Aniket Kittur,et al.  Harnessing the wisdom of crowds in wikipedia: quality through coordination , 2008, CSCW.

[8]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[9]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10]  Morten Warncke-Wang English Wikipedia Quality Asssessment Dataset , 2015 .

[11]  Joshua Evan Blumenstock,et al.  Size matters: word count as a measure of quality on wikipedia , 2008, WWW.

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Les Gasser,et al.  Assessing Information Quality of a Community-Based Encyclopedia , 2005, ICIQ.

[14]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[15]  Besiki Stvilia,et al.  Issues of cross-contextual information quality evaluation—The case of Arabic, English, and Korean Wikipedias , 2009 .

[16]  Les Gasser,et al.  Information quality work organization in wikipedia , 2008, J. Assoc. Inf. Sci. Technol..

[17]  Oded Nov,et al.  Information Quality in Wikipedia: The Effects of Group Composition and Task Conflict , 2011, J. Manag. Inf. Syst..

[18]  Sudha Ram,et al.  Who does what: Collaboration patterns in the wikipedia and their impact on article quality , 2011, TMIS.

[19]  Claudia-Lavinia Ignat,et al.  Quality assessment of Wikipedia articles without feature engineering , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[20]  Bernardo A. Huberman,et al.  Cooperation and quality in wikipedia , 2007, WikiSym '07.

[21]  Linda C. Smith,et al.  INFORMATION QUALITY DISCUSSIONS IN WIKIPEDIA , 2005 .

[22]  Loren G. Terveen,et al.  The Success and Failure of Quality Improvement Projects in Peer Production Communities , 2015, CSCW.