Malicious Text Identification: Deep Learning from Public Comments and Emails

Identifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the Natural Language Processing (NLP) perspective, Deep Learning models are a good alternative for classifying text after being preprocessed. In particular, Long Short-Term Memory (LSTM) networks are one of the models that perform well for the binary and multi-label text classification problems. In this paper, an approach merging two different data sources, one intended for Spam in social media posts and the other for Fraud classification in emails, is presented. We designed a multi-label LSTM model and trained it on the joint datasets including text with common bigrams, extracted from each independent dataset. The experiment results show that our proposed model is capable of identifying malicious text regardless of the source. The LSTM model trained with the merged dataset outperforms the models trained independently on each dataset.

[1]  Shelby R. Curtis,et al.  Phishing attempts among the dark triad: Patterns of attack and vulnerability , 2018, Comput. Hum. Behav..

[2]  Pilsung Kang,et al.  Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec , 2019, Inf. Sci..

[3]  Choon Lin Tan,et al.  A survey of phishing attacks: Their types, vectors and technical approaches , 2018, Expert Syst. Appl..

[4]  Masoumeh Zareapoor,et al.  Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection , 2015 .

[5]  Igor Santos,et al.  Study on the effectiveness of anomaly detection for spam filtering , 2014, Inf. Sci..

[6]  Nasrullah Memon,et al.  Detection of Fraudulent Emails by Employing Advanced Feature Abundance , 2014 .

[7]  Donald E. Brown,et al.  Text Classification Algorithms: A Survey , 2019, Inf..

[8]  Jiaqi Wang,et al.  Three-way enhanced convolutional neural networks for sentence-level sentiment classification , 2019, Inf. Sci..

[9]  Jong Hyuk Park,et al.  Social network security: Issues, challenges, threats, and solutions , 2017, Inf. Sci..

[10]  Choon Lin Tan,et al.  A new hybrid ensemble feature selection framework for machine learning-based phishing detection system , 2019, Inf. Sci..

[11]  Yi Peng,et al.  Understanding influence power of opinion leaders in e-commerce networks: An opinion dynamics theory perspective , 2018, Inf. Sci..

[12]  Keqin Li,et al.  A keyword-based combination approach for detecting phishing webpages , 2019, Comput. Secur..

[13]  Zhigang Cao,et al.  Analyzing user behavior of the micro-blogging website Sina Weibo during hot social events , 2013, 1304.3898.

[14]  Zhiyong Feng,et al.  LSTM with sentence representations for document-level sentiment classification , 2018, Neurocomputing.

[15]  Fu-Lai Chung,et al.  Stacked Robust Adaptively Regularized Auto-Regressions for Domain Adaptation , 2019, IEEE Transactions on Knowledge and Data Engineering.

[16]  Haruna Chiroma,et al.  Machine learning for email spam filtering: review, approaches and open research problems , 2019, Heliyon.

[17]  Gang Kou,et al.  A review on trust propagation and opinion dynamics in social networks and group decision making frameworks , 2019, Inf. Sci..

[18]  Dong-Hong Ji,et al.  Neural networks for deceptive opinion spam detection: An empirical study , 2017, Inf. Sci..

[19]  Gui Xiaolin,et al.  Comparison Research on Text Pre-processing Methods on Twitter Sentiment Analysis , 2017, IEEE Access.

[20]  Sha Dai,et al.  A platform for automatic identification of phishing URLs in mobile text messages , 2018 .

[21]  Enrique Herrera-Viedma,et al.  Sentiment analysis: A review and comparative analysis of web services , 2015, Inf. Sci..

[22]  Haixun Wang,et al.  Understanding short texts through semantic enrichment and hashing , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[23]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[24]  Dong-Hong Ji,et al.  A topic-enhanced word embedding for Twitter sentiment classification , 2016, Inf. Sci..

[25]  Qiang Ye,et al.  Learning Multi-Domain Adversarial Neural Networks for Text Classification , 2019, IEEE Access.

[26]  Marcus A. Butavicius,et al.  Predicting susceptibility to social influence in phishing emails , 2019, Int. J. Hum. Comput. Stud..

[27]  Shuigeng Zhou,et al.  Effectively classifying short texts by structured sparse representation with dictionary filtering , 2015, Inf. Sci..

[28]  Manisha Sharma,et al.  Spam detection in social media using convolutional and long short term memory neural network , 2018, Annals of Mathematics and Artificial Intelligence.

[29]  Muhammad Abulaish,et al.  Multi-Label Classification of Microblogging Texts Using Convolution Neural Network , 2019, IEEE Access.

[30]  Patrícia Augustin Jaques,et al.  An Analysis of Hierarchical Text Classification Using Word Embeddings , 2018, Inf. Sci..