Automatic Moderation of Comments in a Large On-line Journalistic Environment

On-line journalistic sites publish several news and stories every day. Readers of these sites may comment a story, and, as a consequence, a single story might receive thousands of comments. The quality of these comments may vary a lot, from spams to truly useful information. Separating good from bad comments is the primary goal of comment moderation. In this paper we address the problem of automatic moderation of comments in a large journalistic Web site. Participants of the site may interact with each other, constituting a large social network. We propose a classification technique which combines underlying implicit patterns in the comments’ content with patterns hidden in the social network, and then uses the result for automatic moderation. We evaluate our proposed technique using a real collection of comments collected from the Slashdot forum. We compared the proposed technique against traditional ones, such as decision trees and SVMs. We observed that the proposed technique is very effective for scoring comments, reaching more than 96% of accuracy. Classifying comments seems to be a more complex task, and the proposed technique achieves almost 67% of accuracy. Further, the proposed technique is very fast, being able to classify and score hundreds of comments per minute.

[1]  Mohammed J. Zaki,et al.  Multi-evidence, multi-criteria, lazy associative document classification , 2006, CIKM '06.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Gilad Mishne Multiple Ranking Strategies for Opinion Retrieval in Blogs - The University of Amsterdam at the 2006 TREC Blog Track , 2006, TREC.

[4]  Gilad Mishne,et al.  AutoTag: a collaborative approach to automated tag assignment for weblog posts , 2006, WWW '06.

[5]  Soo-Min Kim,et al.  Automatic Detection of Opinion Bearing Words and Sentences , 2005, IJCNLP.

[6]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[7]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[8]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[9]  Tim Oates,et al.  Detecting Spam Blogs: A Machine Learning Approach , 2006, AAAI.

[10]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[11]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[12]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[13]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[14]  Mohammed J. Zaki,et al.  Lazy Associative Classification , 2006, Sixth International Conference on Data Mining (ICDM'06).

[15]  Jinyan Li,et al.  CAEP: Classification by Aggregating Emerging Patterns , 1999, Discovery Science.

[16]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[17]  Elena Baralis,et al.  Essential classification rule sets , 2004, TODS.

[18]  M. de Rijke,et al.  Decomposing Bloggers’ Moods Towards a Time Series Analysis of Moods in the Blogosphere , 2005 .

[19]  Gilad Mishne,et al.  Leave a Reply: An Analysis of Weblog Comments , 2006 .

[20]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[21]  Gilad Mishne,et al.  Capturing Global Mood Levels using Blog Posts , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.