Predicting quality flaws in user-generated content: the case of wikipedia

The detection and improvement of low-quality information is a key concern in Web applications that are based on user-generated content; a popular example is the online encyclopedia Wikipedia. Existing research on quality assessment of user-generated content deals with the classification as to whether the content is high-quality or low-quality. This paper goes one step further: it targets the prediction of quality flaws, this way providing specific indications in which respects low-quality content needs improvement. The prediction is based on user-defined cleanup tags, which are commonly used in many Web applications to tag content that has some shortcomings. We apply this approach to the English Wikipedia, which is the largest and most popular user-generated knowledge source on the Web. We present an automatic mining approach to identify the existing cleanup tags, which provides us with a training corpus of labeled Wikipedia articles. We argue that common binary or multiclass classification approaches are ineffective for the prediction of quality flaws and hence cast quality flaw prediction as a one-class classification problem. We develop a quality flaw model and employ a dedicated machine learning approach to predict Wikipedia's most important quality flaws. Since in the Wikipedia setting the acquisition of significant test data is intricate, we analyze the effects of a biased sample selection. In this regard we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. The flaw prediction performance is evaluated with 10,000 Wikipedia articles that have been tagged with the ten most frequent quality flaws: provided test data with little noise, four flaws can be detected with a precision close to 1.

[1]  Jean-Michel Dalle,et al.  Project management in the Wikipedia community , 2010, Int. Sym. Wikis.

[2]  Andrew Lih,et al.  Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluating collaborative media as a news resource , 2004 .

[3]  Bernardo A. Huberman,et al.  Cooperation and quality in wikipedia , 2007, WikiSym '07.

[4]  Benno Stein,et al.  Automatic Vandalism Detection in Wikipedia , 2008, ECIR.

[5]  Bryan A. Pendleton,et al.  Power of the Few vs. Wisdom of the Crowd: Wikipedia and the Rise of the Bourgeoisie , 2006 .

[6]  Iraklis Varlamis Quality of Content in Web 2.0 Applications , 2010, KES.

[7]  W. Bruce Croft,et al.  Document quality models for web ad hoc retrieval , 2005, CIKM '05.

[8]  Edward G. Schilling,et al.  Juran's Quality Handbook , 1998 .

[9]  Aniket Kittur,et al.  Beyond Wikipedia: coordination and conflict in online production groups , 2010, CSCW '10.

[10]  Carlotta Domeniconi,et al.  Building semantic kernels for text classification using wikipedia , 2008, KDD.

[11]  Benno Stein,et al.  Cluster-based one-class ensemble for classification problems in information retrieval , 2012, SIGIR '12.

[12]  Brian Mingus,et al.  Exploring the Feasibility of Automatically Rating Online Article Quality , 2007 .

[13]  Matthew J. Betts,et al.  Content Disputes in Wikipedia Reflect Geopolitical Instability , 2011, PloS one.

[14]  Markus Helfert,et al.  Information Quality Management: Review of an Evolving Research Area , 2007 .

[15]  Pável Calado,et al.  Automatic Assessment of Document Quality in Web Collaborative Digital Libraries , 2011, JDIQ.

[16]  Martin Potthast,et al.  Overview of the 1st International Competition on Wikipedia Vandalism Detection , 2010, CLEF.

[17]  Jamshid Beheshti,et al.  Collaboration in Context: Comparing Article Evolution among Subject Disciplines in Wikipedia , 2008, First Monday.

[18]  Luca de Alfaro,et al.  A content-driven reputation system for the wikipedia , 2007, WWW '07.

[19]  Ari Pirkola,et al.  A Topic-Specific Web Search System Focusing on Quality Pages , 2010, ECDL.

[20]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[21]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[22]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[23]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[24]  Ed H. Chi,et al.  The singularity is not near: slowing growth of Wikipedia , 2009, Int. Sym. Wikis.

[25]  Sanmay Das,et al.  Collective wisdom: information growth in wikis and blogs , 2010, EC '10.

[26]  R. Rosenzweig Can History Be Open Source? Wikipedia and the Future of the Past , 2006 .

[27]  Cindy Royal,et al.  What's on Wikipedia, and What's Not . . . ? , 2007 .

[28]  Gerald Richard Greenfield Classic readability formulas in an EFL context : are they valid for Japanese speakers? , 1999 .

[29]  Oliver Ferschke,et al.  FlawFinder: A Modular System for Predicting Quality Flaws in Wikipedia , 2012, CLEF.

[30]  Luca de Alfaro,et al.  Detecting Wikipedia Vandalism using WikiTrust - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[31]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[32]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[33]  Les Gasser,et al.  Information quality work organization in wikipedia , 2008, J. Assoc. Inf. Sci. Technol..

[34]  M. Boisot,et al.  Data, information and knowledge: have we got it right? , 2004 .

[35]  Fabian Flöck,et al.  Revisiting reverts: accurate revert detection in wikipedia , 2012, HT '12.

[36]  W. Bruce Croft,et al.  Quality-biased ranking of web documents , 2011, WSDM '11.

[37]  Oded Nov,et al.  Gender differences in Wikipedia editing , 2011, Int. Sym. Wikis.

[38]  Yana Volkovich,et al.  Biographical social networks on Wikipedia: a cross-cultural study of links that made history , 2012, WikiSym '12.

[39]  Benno Stein,et al.  Intrinsic plagiarism analysis , 2011, Lang. Resour. Evaluation.

[40]  Benno Stein,et al.  A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[41]  Tom Cross,et al.  Puppy smoothies: Improving the reliability of open, collaborative wikis , 2006, First Monday.

[42]  Andreas Neus,et al.  Managing Information Quality in Virtual Communities of Practice , 2001, IQ.

[43]  Yue Lu,et al.  Exploiting social context for review quality prediction , 2010, WWW '10.

[44]  John P. Slone INFORMATION QUALITY STRATEGY: AN EMPIRICAL INVESTIGATION OF THE RELATIONSHIP BETWEEN INFORMATION QUALITY IMPROVEMENTS AND ORGANIZATIONAL OUTCOMES , 2006 .

[45]  Ivan Beschastnikh,et al.  Articulations of wikiwork: uncovering valued work in wikipedia through barnstars , 2008, CSCW.

[46]  John Riedl,et al.  Creating, destroying, and restoring value in wikipedia , 2007, GROUP.

[47]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[48]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[49]  Benno Stein,et al.  Measuring the quality of web content using factual information , 2012, WebQuality '12.

[50]  John R. Bormuth,et al.  READABILITY--A NEW APPROACH. , 1966 .

[51]  Myeong-Kwan Kevin Cheon,et al.  Frank and I , 2012 .

[52]  Bart Goethals,et al.  Automatic Vandalism Detection in Wikipedia : Towards a Machine Learning Approach , 2008 .

[53]  Adam Jatowt,et al.  Is wikipedia too difficult?: comparative analysis of readability of wikipedia, simple wikipedia and britannica , 2012, CIKM.

[54]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[55]  Kejia Chen,et al.  Probabilistic quality assessment of articles based on learning editing patterns , 2011, 2011 International Conference on Computer Science and Service System (CSSS).

[56]  Aniket Kittur,et al.  What's in Wikipedia?: mapping topics and conflict using socially annotated category structure , 2009, CHI.

[57]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[58]  Felipe Ortega,et al.  Quantitative Analysis of the Top Ten Wikipedias , 2007, ICSOFT/ENASE.

[59]  Heng-Li Yang,et al.  Motivations of Wikipedia content contributors , 2010, Comput. Hum. Behav..

[60]  Derek Lackaff,et al.  An Analysis of Topical Coverage of Wikipedia , 2008, J. Comput. Mediat. Commun..

[61]  Andrew McCallum,et al.  Learning to Predict the Quality of Contributions to Wikipedia , 2008 .

[62]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[63]  Stefano Mizzaro,et al.  QuWi: quality control in Wikipedia , 2009, WICOW.

[64]  G. Caldarelli,et al.  Preferential attachment in the growth of social networks, the Internet encyclopedia wikipedia , 2007 .

[65]  Alex Dekhtyar,et al.  On measuring the quality of Wikipedia articles , 2010, WICOW '10.

[66]  John Riedl,et al.  WP:clubhouse?: an exploration of Wikipedia's gender imbalance , 2011, Int. Sym. Wikis.

[67]  Paolo Rosso,et al.  On the Use of PU Learning for Quality Flaw Prediction in Wikipedia , 2012, CLEF.

[68]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .

[69]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[70]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[71]  David Laniado,et al.  Co-authorship 2.0: patterns of collaboration in Wikipedia , 2011, HT '11.

[72]  John S. Caylor,et al.  Development of a Simple Readability Index for Job Reading Material. , 1973 .

[73]  Aaron Halfaker,et al.  Wikipedians are born, not made: a study of power editors on Wikipedia , 2009, GROUP.

[74]  Kevyn Collins-Thompson,et al.  A Language Modeling Approach to Predicting Reading Difficulty , 2004, NAACL.

[75]  Diane M. Strong,et al.  AIMQ: a methodology for information quality assessment , 2002, Inf. Manag..

[76]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[77]  David Carmel,et al.  Enhancing cluster labeling using wikipedia , 2009, SIGIR.

[78]  Xianpei Han,et al.  Named entity disambiguation by leveraging wikipedia semantic knowledge , 2009, CIKM.

[79]  Brent Ware,et al.  Open source web development with LAMP : using Linux, Apache, MySQL, Perl, and PHP , 2003 .

[80]  G. Hertel,et al.  Voluntary Engagement in an Open Web-based Encyclopedia: Wikipedians, and Why They Do It , 2007 .

[81]  Benno Stein,et al.  Overview of the 1th International Competition on Quality Flaw Prediction in Wikipedia , 2012, CLEF.

[82]  Cristina V. Lopes,et al.  Statistical measure of quality in Wikipedia , 2010, SOMA '10.

[83]  Benno Stein,et al.  Collection-Relative Representations: A Unifying View to Retrieval Models , 2009, 2009 20th International Workshop on Database and Expert Systems Application.

[84]  Peng Qi,et al.  The Evolution of Wikipedia , 2013 .

[85]  Thomas Wöhner,et al.  Assessing the quality of Wikipedia articles with lifecycle based metrics , 2009, Int. Sym. Wikis.

[86]  Benno Stein,et al.  Identifying featured articles in wikipedia: writing style matters , 2010, WWW '10.

[87]  Benno Stein,et al.  Evaluating Cross-Language Explicit Semantic Analysis and Cross Querying at TEL@CLEF 2009 , 2009, CLEF.

[88]  Gilad Mishne,et al.  Finding high-quality content in social media , 2008, WSDM '08.

[89]  Amy Bruckman,et al.  Scaling Consensus: Increasing Decentralization in Wikipedia Governance , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[90]  Felipe Ortega,et al.  Quantitative Analysis of the Wikipedia Community of Users , 2007 .

[91]  Susan Gauch,et al.  Incorporating quality metrics in centralized/distributed information retrieval on the World Wide Web , 2000, SIGIR '00.

[92]  Jesús M. González-Barahona,et al.  On the Inequality of Contributions to Wikipedia , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[93]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[94]  Santiago Moisés Mola-Velasco,et al.  Wikipedia vandalism detection , 2011, WWW.

[95]  Susan C. Herring,et al.  Collaborative Authoring on the Web: A Genre Analysis of Online Encyclopedias , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[96]  John Riedl,et al.  Using intelligent task routing and contribution review to help communities build artifacts of lasting value , 2006, CHI.

[97]  Diomidis Spinellis,et al.  The collaborative organization of knowledge , 2008, CACM.

[98]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[99]  Martin Wattenberg,et al.  Proceedings of the 40th Hawaii International Conference on System Sciences- 2007 Talk Before You Type: Coordination in Wikipedia , 2022 .

[100]  Ricardo Baeza-Yates,et al.  User generated content: how good is it? , 2009, WICOW.

[101]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[102]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[103]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[104]  David M. J. Tax,et al.  One-class classification , 2001 .

[105]  R. Gunning The Technique of Clear Writing. , 1968 .

[106]  V. Zlatic,et al.  Wikipedias: collaborative web-based encyclopedias as complex networks. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[107]  Liviu Iftode,et al.  Finding hierarchy in directed online social networks , 2011, WWW.

[108]  Oded Nov,et al.  What motivates Wikipedians? , 2007, CACM.

[109]  Aniket Kittur,et al.  Herding the cats: the influence of groups in coordinating peer production , 2009, Int. Sym. Wikis.

[110]  Benno Stein,et al.  On the Evolution of Quality Flaws and the Effectiveness of Cleanup Tags in the English Wikipedia , 2012 .

[111]  E A Smith,et al.  Automated readability index. , 1967, AMRL-TR. Aerospace Medical Research Laboratories.

[112]  Mikhil Masli,et al.  "How should I go from ___ to ___ without getting killed?": motivation and benefits in open collaboration , 2011, Int. Sym. Wikis.

[113]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[114]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[115]  Ee-Peng Lim,et al.  Measuring article quality in wikipedia: models and evaluation , 2007, CIKM '07.

[116]  Tony R. Martinez,et al.  Bias and the probability of generalization , 1997, Proceedings Intelligent Information Systems. IIS'97.

[117]  Les Gasser,et al.  Assessing Information Quality of a Community-Based Encyclopedia , 2005, ICIQ.

[118]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[119]  Fernanda B. Viégas The Visual Side of Wikipedia , 2007, 2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07).

[120]  Matthijs den Besten,et al.  Wikibugs: using template messages in open content collections , 2009, Int. Sym. Wikis.

[121]  Martin Wattenberg,et al.  Studying cooperation and conflict between authors with history flow visualizations , 2004, CHI.

[122]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[123]  Stuart E. Madnick,et al.  Overview and Framework for Data and Information Quality Research , 2009, JDIQ.

[124]  Benno Stein,et al.  A breakdown of quality flaws in Wikipedia , 2012, WebQuality '12.

[125]  Stuart E. Madnick,et al.  Improving data quality through effective use of data semantics , 2006, Data Knowl. Eng..

[126]  Robert M. Mason,et al.  Negotiating Cultural Values in Social Media: A Case Study from Wikipedia , 2012, 2012 45th Hawaii International Conference on System Sciences.

[127]  Carlo Curino,et al.  Schema Evolution in Wikipedia - Toward a Web Information System Benchmark , 2008, ICEIS.

[128]  Matthijs den Besten,et al.  Coordination and Division of Labor in Open Content Communities: The Role of Template Messages in Wikipedia , 2010, 2010 43rd Hawaii International Conference on System Sciences.

[129]  Joshua Blumenstock Automatically Assessing the Quality of Wikipedia Articles , 2008 .

[130]  J. Giles Internet encyclopaedias go head to head , 2005, Nature.

[131]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[132]  Benno Stein,et al.  Detection of text quality flaws as a one-class classification problem , 2011, CIKM '11.

[133]  James M. Purtilo,et al.  Measuring the wikisphere , 2009, Int. Sym. Wikis.

[134]  Stacey Kuznetsov,et al.  Motivations of contributors to Wikipedia , 2006, CSOC.

[135]  Luciana S. Buriol,et al.  Temporal Analysis of the Wikigraph , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[136]  Deborah L. McGuinness,et al.  Computing trust from revision history , 2006, PST.

[137]  Benno Stein,et al.  The ESA retrieval model revisited , 2009, SIGIR.

[138]  Benno Stein,et al.  Insights into explicit semantic analysis , 2011, CIKM '11.

[139]  Ian H. Witten,et al.  One-Class Classification by Combining Density and Class Probability Estimation , 2008, ECML/PKDD.

[140]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[141]  Tharam S. Dillon,et al.  Content Quality Assessment Related Frameworks for Social Media , 2009, ICCSA.

[142]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.

[143]  Joshua Evan Blumenstock,et al.  Size matters: word count as a measure of quality on wikipedia , 2008, WWW.

[144]  Benno Stein,et al.  Cross-Language High Similarity Search: Why No Sub-linear Time Bound Can Be Expected , 2010, ECIR.

[145]  Shaowen Bardzell,et al.  Some of all human knowledge: gender and participation in peer production , 2012, CSCW.

[146]  Pável Calado,et al.  Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia , 2009, JCDL '09.

[147]  Aniket Kittur,et al.  He says, she says: conflict and coordination in Wikipedia , 2007, CHI.

[148]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[149]  Flavio Figueiredo,et al.  Evidence of quality of textual features on the web 2.0 , 2009, CIKM.

[150]  Lucy Holman Rector Comparison of Wikipedia and other encyclopedias for accuracy, breadth, and depth in historical articles , 2008 .

[151]  Benno Stein,et al.  Towards automatic quality assurance in Wikipedia , 2011, WWW.