Automated detection of offensive language behavior on social networking sites

Social Networking Sites are booming as never before. Apart from the numerous new opportunities that are provided, also hazards such as messages containing sexual harrasment or racist attacks have to be taken into account. Since manually monitoring and analysing all messages seperately is unattainable, solutions using automated methods are sought. This study applies machine learning techniques to perform automated offensive language detection. Offensive language can be defined as “expressing extreme subjectivity“ and this study mainly focuses on two categories ’sexual’ and ’racist’. A corpus, originating from the Dutch distribution of the social network Netlog, is used and contains over seven million blog messages. We note that only a very small amount (approximately 0.85%) of these blog messages can be defined as messages that contain abusive language. Initially, the intention is to implement two supervised learning methods Naive Bayes and Support Vector Machine. These methods base the classification of a message on previous experiences, derived from a labeled training set. To build such training set offensive messages should be efficiently extracted out of the corpus. In order to achieve this, an information retrieval system, expanded with a query expansion technique, is applied. A query containing offensive terms delivers offensive messages, however a more efficient approach is considered by enhancing the query using Rocchio query expansion. This study shows that using query expansion can effectively increase the amount of relevant messages retrieved. These supervised classifiers are trained on the labeled set and afterwards their performance is tested on an independant validation set. The Naive Bayes classifier does not perform well on the validation set and is therefore disregarded in the further analysis. Our Support Vector Machine implementation achieves results of approximately 69% precision and 62% recall. However, these results are obtained by ignoring very small messages, since SVM has difficulties classifying messages that do not contain much information. To tackle the issues SVM suffers from a more reliable, but less dynamic method is designed, based on word lists. This method, that is named a semantic classifier, obtains

[1]  Robert M. Losee How Part-of-Speech Tags Affect Text Retrieval and Filtering Performance , 1996, ArXiv.

[2]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[3]  Ruslan Mitkov,et al.  The Oxford handbook of computational linguistics , 2003 .

[4]  Ellen Riloff,et al.  Learning subjective nouns using extraction pattern bootstrapping , 2003, CoNLL.

[5]  Razvan C. Bunescu,et al.  Sentiment analyzer: extracting sentiments about a given topic using natural language processing techniques , 2003, Third IEEE International Conference on Data Mining.

[6]  Danah Boyd,et al.  Social network sites: definition, history, and scholarship , 2007, IEEE Engineering Management Review.

[7]  S. Dumais Latent Semantic Analysis. , 2005 .

[8]  S. S. Iyengar,et al.  An Evaluation of Filter and Wrapper Methods for Feature Selection in Categorical Clustering , 2005, IDA.

[9]  Wei-Ying Ma,et al.  Probabilistic query expansion using query logs , 2002, WWW '02.

[10]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[11]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[12]  Ellen Spertus,et al.  Smokey: Automatic Recognition of Hostile Messages , 1997, AAAI/IAAI.

[13]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[14]  Neil Rubens The Application of Fuzzy Logic to the Construction of the Ranking Function of Information Retrieval Systems , 2006, ArXiv.

[15]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[16]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[17]  Sarah Schrauwen CTRS-001 28 July 2010 MACHINE LEARNING APPROACHES TO SENTIMENT ANALYSIS USING THE DUTCH NETLOG CORPUS , 2010 .

[18]  W. Bruce Croft,et al.  Indri at TREC 2004: Terabyte Track , 2004, TREC.

[19]  W. Bruce Croft,et al.  Improving the effectiveness of information retrieval with local context analysis , 2000, TOIS.

[20]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[21]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[22]  Mike Thelwall,et al.  Fk yea I swear: cursing and gender in MySpace , 2008 .

[23]  David L. Olson,et al.  Advanced Data Mining Techniques , 2008 .

[24]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI): TREC-3 Report , 1994, TREC.

[25]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[26]  Abe Kazemzadeh,et al.  Recognizing Expressions of Commonsense Psychology in English Text , 2003, ACL.

[27]  Xin Li,et al.  Context sensitive stemming for web search , 2007, SIGIR.

[28]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[29]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[30]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[31]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[32]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[33]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[34]  Mumit Khan,et al.  Detecting flames and insults in text , 2008 .

[35]  Ellen Riloff,et al.  Learning Extraction Patterns for Subjective Expressions , 2003, EMNLP.

[36]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[37]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[38]  F. Segond,et al.  An Experiment in Semantic Tagging using Hidden Markov Model Tagging , 1997 .

[39]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[40]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[41]  Yves Schabes,et al.  Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction , 1996, ACL.

[42]  Zhi Xu,et al.  Filtering Offensive Language in Online Communities using Grammatical Relations , 2010 .

[43]  C.W. Anderson,et al.  Comparison of linear, nonlinear, and feature selection methods for EEG signal classification , 2003, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[44]  Anil K. Jain,et al.  Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[45]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[46]  M. Ranieri RECALL , 2010, Encyclopedia of Evolutionary Psychological Science.

[47]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[48]  W. Bruce Croft,et al.  A framework for selective query expansion , 2004, CIKM '04.

[49]  Craig MacDonald,et al.  Terrier Information Retrieval Platform , 2005, ECIR.

[50]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[51]  Hong Yu,et al.  Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences , 2003, EMNLP.

[52]  Sotiris B. Kotsiantis,et al.  Machine learning: a review of classification and combining techniques , 2006, Artificial Intelligence Review.

[53]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[54]  Kevyn Collins-Thompson,et al.  Query expansion using random walk models , 2005, CIKM '05.

[55]  Madely du Preez Social Networking Communities and E‐dating Services: Concepts and Implications , 2009 .

[56]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[57]  W. Bruce Croft,et al.  INQUERY System Overview , 1993, TIPSTER.

[58]  Korris Fu-Lai Chung,et al.  Improving weak ad-hoc queries using wikipedia asexternal corpus , 2007, SIGIR.

[59]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[60]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[61]  Masoud Rahgozar,et al.  Query Expansion Using Wikipedia Concept Graph , 2008 .

[62]  Janyce Wiebe,et al.  Learning Subjective Language , 2004, CL.

[63]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[64]  Pero Subasic,et al.  Affect analysis of text using fuzzy semantic typing , 2000, Ninth IEEE International Conference on Fuzzy Systems. FUZZ- IEEE 2000 (Cat. No.00CH37063).

[65]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[66]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[67]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[68]  Stan Matwin,et al.  Offensive Language Detection Using Multi-level Classification , 2010, Canadian Conference on AI.