Quality assessment of collaborative content with minimal information

Content generated by users is one of the most interesting phenomena of published media. However, the possibility of unrestricted edition is a source of doubts about its quality. This issue has motivated many studies on how to automatically assess content quality in collaborative web sites. Generally, these studies use machine learning techniques to combine large number of quality indicators into a single value representing the overall quality of the document. This need for a high number of indicators, however, has detrimental implications both on the efficiency and on the effectiveness of the quality assessment algorithms. In this work, we exploit and extend a feature selection method based on the SPEA2 multi-objective genetic algorithm. Results show that we can reduce the feature set to a fraction of 15% through 25% of the original, while obtaining error rates comparable to the state of the art.

[1]  Marco Laumanns,et al.  SPEA2: Improving the strength pareto evolutionary algorithm , 2001 .

[2]  Ee-Peng Lim,et al.  Measuring article quality in wikipedia: models and evaluation , 2007, CIKM '07.

[3]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[4]  Marco Laumanns,et al.  On the Effects of Archiving, Elitism, and Density Based Selection in Evolutionary Multi-objective Optimization , 2001, EMO.

[5]  Brian Mingus,et al.  Exploring the Feasibility of Automatically Rating Online Article Quality , 2007 .

[6]  Pável Calado,et al.  A Multi-view Approach for the Quality Assessment of Wiki Articles , 2012, J. Inf. Data Manag..

[7]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[8]  E A Smith,et al.  Automated readability index. , 1967, AMRL-TR. Aerospace Medical Research Laboratories.

[9]  Tiejian Luo,et al.  Measuring article quality in Wikipedia: Lexical clue model , 2011, 2011 3rd Symposium on Web Society.

[10]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[11]  Sanford Ressler,et al.  Perspectives on electronic publishing - standards, solutions, and more , 1993 .

[12]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[13]  Pável Calado,et al.  Automatic Assessment of Document Quality in Web Collaborative Digital Libraries , 2011, JDIQ.

[14]  Haleh Vafaie,et al.  Feature Selection Methods: Genetic Algorithms vs. Greedy-like Search , 2009 .

[15]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[16]  Stephen Barrett,et al.  Extracting Trust from Domain Analysis: A Case Study on the Wikipedia Project , 2006, ATC.

[17]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[18]  David W. Opitz,et al.  Feature Selection for Ensembles , 1999, AAAI/IAAI.

[19]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .

[20]  Pável Calado,et al.  On MultiView-Based Meta-learning for Automatic Quality Assessment of Wiki Articles , 2012, TPDL.

[21]  Alexey Tsymbal,et al.  Ensemble feature selection with the simple Bayesian classification , 2003, Inf. Fusion.

[22]  R. Gunning The Technique of Clear Writing. , 1968 .