Artificial immune system for illicit content identification in social media

Social media is frequently used as a platform for the exchange of information and opinions as well as propaganda dissemination. But online content can be misused for the distribution of illicit information, such as violent postings in web forums. Illicit content is highly distributed in social media, while non-illicit content is unspecific and topically diverse. It is costly and time consuming to label a large amount of illicit content (positive examples) and non-illicit content (negative examples) to train classification systems. Nevertheless, it is relatively easy to obtain large volumes of unlabeled content in social media. In this article, an artificial immune system-based technique is presented to address the difficulties in the illicit content identification in social media. Inspired by the positive selection principle in the immune system, we designed a novel labeling heuristic based on partially supervised learning to extract high-quality positive and negative examples from unlabeled datasets. The empirical evaluation results from two large hate group web forums suggest that our proposed approach generally outperforms the benchmark techniques and exhibits more stable performance. © 2012 Wiley Periodicals, Inc.

[1]  Wai Lam,et al.  A New Approach for Semi-supervised Online News Classification , 2005, Human.Society@Internet.

[2]  Thiago S. Guzella,et al.  Identification of SPAM messages using an approach inspired on the immune system , 2008, Biosyst..

[3]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[4]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Andrew Tomkins,et al.  Guest Editors' Introduction: Social Media and Search , 2007, IEEE Internet Computing.

[7]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[8]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[9]  Alex Alves Freitas,et al.  Revisiting the Foundations of Artificial Immune Systems: A Problem-Oriented Perspective , 2003, ICARIS.

[10]  L. R. Huesmann,et al.  The impact of electronic media violence: scientific theory and research. , 2007, The Journal of adolescent health : official publication of the Society for Adolescent Medicine.

[11]  Leandro Nunes de Castro,et al.  Artificial Immune Systems: A Novel Approach to Pattern Recognition , 2002 .

[12]  Jonathan Timmis,et al.  Artificial Immune Systems: A New Computational Intelligence Approach , 2003 .

[13]  Lee-Feng Chien,et al.  Web-based text classification in the absence of manually labeled training documents , 2007 .

[14]  Qiang Yang,et al.  Learning with Positive and Unlabeled Examples Using Topic-Sensitive PLSA , 2010, IEEE Transactions on Knowledge and Data Engineering.

[15]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[16]  Alex Alves Freitas,et al.  Revisiting the Foundations of Artificial Immune Systems for Data Mining , 2007, IEEE Transactions on Evolutionary Computation.

[17]  Hsinchun Chen,et al.  Analysis of Affect Intensities in Extremist Group Forums , 2008 .

[18]  G. Oster,et al.  Theoretical studies of clonal selection: minimal antibody repertoire size and reliability of self-non-self discrimination. , 1979, Journal of theoretical biology.

[19]  Hsinchun Chen,et al.  Applying Authorship Analysis to Arabic Web Content , 2005, ISI.

[20]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[21]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[22]  Chih-Ping Wei,et al.  Effective spam filtering: A single-class learning and ensemble approach , 2008, Decis. Support Syst..

[23]  Philip S. Yu,et al.  Text classification without negative examples revisit , 2006, IEEE Transactions on Knowledge and Data Engineering.

[24]  Pero Subasic,et al.  Affect analysis of text using fuzzy semantic typing , 2000, Ninth IEEE International Conference on Fuzzy Systems. FUZZ- IEEE 2000 (Cat. No.00CH37063).

[25]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[26]  Jonathan Timmis,et al.  Application Areas of AIS: The Past, The Present and The Future , 2005, ICARIS.

[27]  Hsinchun Chen,et al.  Text‐based video content classification for online video‐sharing sites , 2010, J. Assoc. Inf. Sci. Technol..

[28]  Jonathan Timmis,et al.  Application areas of AIS: The past, the present and the future , 2008, Appl. Soft Comput..

[29]  Hsinchun Chen,et al.  Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums , 2008, TOIS.

[30]  Jonathan Timmis,et al.  Artificial immune systems - a new computational intelligence paradigm , 2002 .

[31]  Hsinchun Chen,et al.  Affect Analysis of Web Forums and Blogs Using Correlation Ensembles , 2008, IEEE Transactions on Knowledge and Data Engineering.

[32]  Alex Alves Freitas,et al.  AISIID: An artificial immune system for interesting information discovery on the web , 2008, Appl. Soft Comput..

[33]  Fabio Gagliardi Cozman,et al.  Semi-supervised Learning of Classifiers : Theory , Algorithms and Their Application to Human-Computer Interaction , 2004 .

[34]  Nicu Sebe,et al.  Semisupervised learning of classifiers: theory, algorithms, and their application to human-computer interaction , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.