The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia

With the increasing amount of user generated reference texts in the web, automatic quality assessment has become a key challenge. However, only a small amount of annotated data is available for training quality assessment systems. Wikipedia contains a large amount of texts annotated with cleanup templates which identify quality flaws. We show that the distribution of these labels is topically biased, since they cannot be applied freely to any arbitrary article. We argue that it is necessary to consider the topical restrictions of each label in order to avoid a sampling bias that results in a skewed classifier and overly optimistic evaluation results. We factor out the topic bias by extracting reliable training instances from the revision history which have a topic distribution similar to the labeled articles. This approach better reflects the situation a classifier would face in a real-life application.

[1]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[2]  Benno Stein,et al.  Predicting quality flaws in user-generated content: the case of wikipedia , 2012, SIGIR '12.

[3]  Graeme Hirst,et al.  Native language detection with 'cheap' learner corpora , 2013 .

[4]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[5]  Benno Stein,et al.  Overview of the 1th International Competition on Quality Flaw Prediction in Wikipedia , 2012, CLEF.

[6]  Iryna Gurevych,et al.  Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary , 2008, LREC.

[7]  Oliver Ferschke,et al.  FlawFinder: A Modular System for Predicting Quality Flaws in Wikipedia , 2012, CLEF.

[8]  János Csirik,et al.  The CoNLL-2010 Shared Task: Learning to Detect Hedges and their Scope in Natural Language Text , 2010, CoNLL Shared Task.

[9]  Oliver Ferschke,et al.  Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia’s Edit History , 2011, ACL.

[10]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[11]  Iryna Gurevych,et al.  Cross-Genre and Cross-Domain Detection of Semantic Uncertainty , 2012, CL.

[12]  Oliver Ferschke,et al.  Behind the Article: Recognizing Dialog Acts in Wikipedia Talk Pages , 2012, EACL.

[13]  Benno Stein,et al.  A breakdown of quality flaws in Wikipedia , 2012, WebQuality '12.

[14]  Paolo Rosso,et al.  On the Use of PU Learning for Quality Flaw Prediction in Wikipedia , 2012, CLEF.

[15]  Moshe Koppel,et al.  Exploiting Stylistic Idiosyncrasies for Authorship Attribution , 2003 .

[16]  Walter Daelemans,et al.  Shallow Text Analysis and Machine Learning for Authorship Attribtion , 2005, CLIN.

[17]  Les Gasser,et al.  Information quality work organization in wikipedia , 2008, J. Assoc. Inf. Sci. Technol..

[18]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[19]  George K. Mikros,et al.  Investigating Topic Influence in Authorship Attribution , 2007, PAN.