论文信息 - Using Crowdsourcing to Improve Profanity Detection

Using Crowdsourcing to Improve Profanity Detection

Profanity detection is often thought to be an easy task. However, past work has shown that current, list-based systems are performing poorly. They fail to adapt to evolving profane slang, identify profane terms that have been disguised or only partially censored (e.g., @ss, f$#%) or intentionally or unintentionally misspelled (e.g., biatch, shiiiit). For these reasons, they are easy to circumvent and have very poor recall. Secondly, they are a one-size fits all solution – making assumptions that the definition, use and perceptions of profane or inappropriate holds across all contexts. In this article, we present work that attempts to move beyond list-based profanity detection systems by identifying the context in which profanity occurs. The proposed system uses a set of comments from a social news site labeled by Amazon Mechanical Turk workers for the presence of profanity. This system far surpasses the performance of list-based profanity detection techniques. The use of crowdsourcing in this task suggests an opportunity to build profanity detection systems tailored to sites and communities.

[1] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[2] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[3] David A. Forsyth,et al. Finding Naked People , 1996, ECCV.

[4] James Ze Wang,et al. System for Screening Objectionable Images Using Daubechies' Wavelets and Color Histograms , 1997, IDMS.

[5] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[6] John Swarbrooke,et al. Case Study 18 – Las Vegas, Nevada, USA , 2007 .

[7] Panagiotis G. Ipeirotis,et al. Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[8] Chris Callison-Burch,et al. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[9] Chin-Laung Lei,et al. A Collusion-Resistant Automation Scheme for Social Moderation Systems , 2009, 2009 6th IEEE Consumer Communications and Networking Conference.

[10] Duncan J. Watts,et al. Financial incentives and the "performance of crowds" , 2009, HCOMP '09.

[11] Brian D. Davison,et al. Detection of Harassment on Web 2.0 , 2009 .

[12] April Kontostathis,et al. Text Mining and Cybercrime , 2010 .

[13] Martin Chodorow,et al. Rethinking Grammatical Error Annotation and Evaluation with the Amazon Mechanical Turk , 2010 .

[14] Clifford Nass,et al. Normative influences on thoughtful online participation , 2011, CHI.

[15] Henry Lieberman,et al. Modeling the Detection of Textual Cyberbullying , 2011, The Social Mobile Web.

[16] Elizabeth F. Churchill,et al. Automatic identification of personal insults on social news sites , 2012, J. Assoc. Inf. Sci. Technol..

[17] Elizabeth F. Churchill,et al. Profanity use in online communities , 2012, CHI.