Attribute selection for improving spam classification in online social networks: a rough set theory-based approach

As online social network (OSN) sites become increasingly popular, they are targeted by spammers who post malicious content on the sites. Hence, it is important to filter out spam accounts and spam posts from OSNs. There exist several prior works on spam classification on OSNs, which utilize various features to distinguish between spam and legitimate entities. The objective of this study is to improve such spam classification, by developing an attribute selection methodology that helps to find a smaller subset of the attributes which leads to better classification. Specifically, we apply the concepts of rough set theory to develop the attribute selection algorithm. We perform experiments over five different spam classification datasets over diverse OSNs and compare the performance of the proposed methodology with that of several baseline methodologies for attribute selection. We find that, for most of the datasets, the proposed methodology selects an attribute subset that is smaller than what is selected by the baseline methodologies, yet achieves better classification performance compared to the other methods.

[1]  M. Zhang,et al.  A rough sets based approach to feature selection , 2004, IEEE Annual Meeting of the Fuzzy Information, 2004. Processing NAFIPS '04..

[2]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[3]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[4]  Danah Boyd,et al.  Detecting Spam in a Twitter Network , 2009, First Monday.

[5]  Chris Moore,et al.  Sharing music files: Tactics of a challenge to the industry , 2010, First Monday.

[6]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[7]  Duoqian Miao,et al.  A rough set approach to feature selection based on ant colony optimization , 2010, Pattern Recognit. Lett..

[8]  Li Pheng Khoo,et al.  Feature extraction using rough set theory and genetic algorithms--an application for the simplification of product quality evaluation , 2002 .

[9]  Silke Wagner,et al.  Comparing Clusterings - An Overview , 2007 .

[10]  Kyumin Lee,et al.  Uncovering social spammers: social honeypots + machine learning , 2010, SIGIR.

[11]  Virgílio A. F. Almeida,et al.  Detecting Spammers and Content Promoters in Online Video Social Networks , 2009, IEEE INFOCOM Workshops 2009.

[12]  Andrzej Skowron,et al.  Rough set methods in feature selection and recognition , 2003, Pattern Recognit. Lett..

[13]  Jun Hu,et al.  Detecting and characterizing social spam campaigns , 2010, IMC '10.

[14]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Stephen Kokoska,et al.  Chance Encounters: A First Course in Data Analysis and Inference , 2001, Technometrics.

[16]  Muhammad Abulaish,et al.  A generic statistical approach for spam detection in Online Social Networks , 2013, Comput. Commun..

[17]  K Nandhini,et al.  Cosdes: A Collaborative Spam Detection System With A Novel E-Mail Abstraction Scheme , 2019 .

[18]  Jaber Karimpour,et al.  The Impact of Feature Selection on Web Spam Detection , 2012 .

[19]  Andrzej Skowron,et al.  The Discernibility Matrices and Functions in Information Systems , 1992, Intelligent Decision Support.

[20]  Vijay V. Raghavan,et al.  Feature Selection and Effective Classifiers , 1998, J. Am. Soc. Inf. Sci..

[21]  Yudong Zhang,et al.  Spam Detection via Feature Selection and Decision Tree , 2012 .

[22]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[23]  Maozhen Li,et al.  A survey of emerging approaches to spam filtering , 2012, CSUR.

[24]  Fabrício Benevenuto,et al.  Detecting tip spam in location-based social networks , 2013, SAC '13.

[25]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[26]  Fabrício Benevenuto,et al.  Phi.sh/$oCiaL: the phishing landscape through short URLs , 2011, CEAS '11.

[27]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[28]  Georgia Koutrika,et al.  Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges , 2007, IEEE Internet Computing.

[29]  Juan Martínez-Romo,et al.  Detecting malicious tweets in trending topics using a statistical analysis of language , 2013, Expert Syst. Appl..

[30]  Fabrício Benevenuto,et al.  Pollution, bad-mouthing, and local marketing: The underground of location-based social networks , 2014, Inf. Sci..

[31]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[32]  Zhang Zheng-chao,et al.  An Attribute Reduction Algorithm based on Rough Set, Information Entropy and Ant Colony optimization , 2010, IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS.

[33]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[34]  Zdzisław Pawlak,et al.  Rough sets. Basic notions , 1981 .

[35]  Jong Kim,et al.  WarningBird: A Near Real-Time Detection System for Suspicious URLs in Twitter Stream , 2013, IEEE Transactions on Dependable and Secure Computing.

[36]  Rafael Bello,et al.  Feature Selection Algorithms Using Rough Set Theory , 2007, Seventh International Conference on Intelligent Systems Design and Applications (ISDA 2007).

[37]  Zdzislaw Pawlak,et al.  Rough Set Theory and its Applications to Data Analysis , 1998, Cybern. Syst..

[38]  Huan Liu,et al.  A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[39]  Ujjwal Maulik,et al.  Integration of dense subgraph finding with feature clustering for unsupervised feature selection , 2014, Pattern Recognit. Lett..

[40]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[41]  Kyumin Lee,et al.  Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter , 2011, ICWSM.