Identifying malicious social media contents using multi-view Context-Aware active learning

Abstract This paper presents a semi-supervised, multi-view, active learning method, which uses an optimized set of most informative samples and utilizes domain specific context information to efficiently and effectively identify malicious forum content in web-based social media platforms. As research shows, the task of automated identification of malicious forum posts, which also helps in detecting their associated key suspects in web forums, faces numerous challenges: (1) Online data, particularly social media data originate from diverse and heterogeneous sources and are largely unstructured; (2) Online data characteristics evolve quickly; and, (3) There are limited amounts of ground truth data to support the development of effective classification technologies in a strictly supervised scenario. In order to address the above challenges, the proposed human–machine collaborative, semi-supervised learning method is designed to efficiently and effectively identify harmful, provocative, or fabricated forum content by observing only a small number of annotated samples. Our learning framework is initiated by modeling initial view-dependent classifiers from a limited labeled data collection and allows each, in an interactive manner, to evolve dynamically into a sophisticated model by observing data patterns from a shared shortlist of most informative samples, identified via a graph-based optimization method and solved by a maximum flow algorithm. By designing a context rich metric definition in a data-driven manner, the proposed framework is able to learn a sufficiently robust classification model, that utilizes only a small number of human annotated samples, typically 1–2 orders of magnitude fewer as compared to a fully supervised solution. We validate our method using a large collection of flagged words with a wide range of origins, words frequently appearing in web-based forums and manually verified by multiple experienced, independent domain experts.

[1]  Manali Sharma,et al.  Active Learning with Rationales for Text Classification , 2015, NAACL.

[2]  Paulo Shakarian,et al.  Data Driven Game Theoretic Cyber Threat Mitigation , 2016, AAAI.

[3]  Paul Almeida Book Review: "Inside Rebellion: The Politics of Insurgent Violence" by Jeremy M. Weinstein , 2007 .

[4]  Barbara Poblete,et al.  Information credibility on twitter , 2011, WWW.

[5]  Jennifer G. Dy,et al.  Active Learning from Multiple Knowledge Sources , 2012, AISTATS.

[6]  William T. Freeman,et al.  On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs , 2001, IEEE Trans. Inf. Theory.

[7]  Paulo Shakarian,et al.  Exploring Malicious Hacker Forums , 2016, Cyber Deception.

[8]  Jan Kautz,et al.  Hierarchical Subquery Evaluation for Active Learning on a Graph , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Paulo Shakarian,et al.  Predicting Cyber Threats through Hacker Social Networks in Darkweb and Deepweb Forums , 2017 .

[10]  Kristina Lerman,et al.  Discovering Signals from Web Sources to Predict Cyber Attacks , 2018, ArXiv.

[11]  Matthew S. Gerber,et al.  Automatic detection of cyber-recruitment by violent extremists , 2014, Security Informatics.

[12]  Aryya Gangopadhyay,et al.  Multimode co-clustering for analyzing terrorist networks , 2016, Information Systems Frontiers.

[13]  Sethuraman Panchanathan,et al.  Batch Mode Active Sampling Based on Marginal Probability Distribution Matching , 2013, ACM Trans. Knowl. Discov. Data.

[14]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[15]  Robyn Torok “Make A Bomb In Your Mums Kitchen”: Cyber Recruiting And Socialisation of ‘White Moors’ and Home Grown Jihadists , 2010 .

[16]  Ehab Al-Shaer,et al.  Prioritized active learning for malicious URL detection using weighted text-based features , 2017, 2017 IEEE International Conference on Intelligence and Security Informatics (ISI).

[17]  Jaime G. Carbonell,et al.  Buy-in-Bulk Active Learning , 2013, NIPS.

[18]  Ashit Talukder,et al.  Active learning based news veracity detection with feature weighting and deep-shallow fusion , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[19]  Ahmad Diab,et al.  Product offerings in malicious hacker markets , 2016, 2016 IEEE Conference on Intelligence and Security Informatics (ISI).

[20]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[21]  Xiaoyu Zhang,et al.  Bidirectional Active Learning: A Two-Way Exploration Into Unlabeled and Labeled Data Set , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[22]  Imran Awan Cyber-Extremism: Isis and the Power of Social Media , 2017, Society.

[23]  George F. Hurlburt,et al.  Shining a Light on the Dark Web , 2023, Communications of the ACM.

[24]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[25]  Ashit Talukder,et al.  Identifying extremism in social media with multi-view context-aware subset optimization , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[26]  Marc Rogers,et al.  The Psychology of Cyber‐Terrorism , 2008 .

[27]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[28]  Vladimir Kolmogorov,et al.  An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Djamal Benslimane,et al.  Measuring the Radicalisation Risk in Social Networks , 2017, IEEE Access.

[30]  Christopher J. C. Burges,et al.  Spectral clustering and transductive learning with multiple views , 2007, ICML '07.

[31]  Rong Jin,et al.  Active Learning by Querying Informative and Representative Examples , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[33]  Qingshan Liu,et al.  Joint Active Learning with Feature Selection via CUR Matrix Decomposition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Jian-Ping Mei,et al.  A Fuzzy Approach for Multitype Relational Data Clustering , 2012, IEEE Transactions on Fuzzy Systems.

[35]  Jie Tang,et al.  Batch Mode Active Learning for Networked Data , 2012, TIST.

[36]  Wenbin Cai,et al.  Batch Mode Active Learning for Regression With Expected Model Change , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[37]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[38]  Wei Liu,et al.  Exploring Representativeness and Informativeness for Active Learning , 2019, IEEE Transactions on Cybernetics.

[39]  Roger Petersen,et al.  Resistance and Rebellion: Lessons from Eastern Europe , 2001 .

[40]  Qiaozhu Mei,et al.  Enquiring Minds: Early Detection of Rumors in Social Media from Enquiry Posts , 2015, WWW.

[41]  Marc Cheong,et al.  A microblogging-based approach to terrorism informatics: Exploration and chronicling civilian sentiment and response to terrorism events via Twitter , 2011, Inf. Syst. Frontiers.

[42]  Kyomin Jung,et al.  Prominent Features of Rumor Propagation in Online Social Media , 2013, 2013 IEEE 13th International Conference on Data Mining.

[43]  Sanjoy Dasgupta,et al.  Hierarchical sampling for active learning , 2008, ICML '08.

[44]  Jason J. Corso,et al.  Active Clustering with Model-Based Uncertainty Reduction , 2014, IEEE transactions on pattern analysis and machine intelligence.

[45]  Alex Hai Wang,et al.  Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach , 2010, DBSec.

[46]  Andrew C. Trapp,et al.  Overcoming human trafficking via operations research and analytics: Opportunities for methods, models, and applications , 2017, Eur. J. Oper. Res..

[47]  Dragomir R. Radev,et al.  Rumor has it: Identifying Misinformation in Microblogs , 2011, EMNLP.

[48]  Elisabeth Jean Wood,et al.  Insurgent Collective Action and Civil War in El Salvador: References , 2003 .

[49]  Macartan Humphreys,et al.  Who Fights? The Determinants of Participation in Civil War , 2008 .

[50]  Fernando De la Torre,et al.  Facing Imbalanced Data--Recommendations for the Use of Performance Metrics , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[51]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[52]  Xue Zhang,et al.  Literature survey of active learning in multimedia annotation and retrieval , 2013, ICIMCS '13.

[53]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[54]  Meng Wang,et al.  Scalable Active Learning by Approximated Error Reduction , 2018, KDD.

[55]  Hsinchun Chen,et al.  Uncovering the dark Web: A case study of Jihad on the Web , 2008 .

[56]  Min Wang,et al.  Active learning through two-stage clustering , 2018, 2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).