Soft computing based imputation and hybrid data and text mining: The case of predicting the severity of phishing alerts

In this paper, we employ a novel two-stage soft computing approach for data imputation to assess the severity of phishing attacks. The imputation method involves K-means algorithm and multilayer perceptron (MLP) working in tandem. The hybrid is applied to replace the missing values of financial data which is used for predicting the severity of phishing attacks in financial firms. After imputing the missing values, we mine the financial data related to the firms along with the structured form of the textual data using multilayer perceptron (MLP), probabilistic neural network (PNN) and decision trees (DT) separately. Of particular significance is the overall classification accuracy of 81.80%, 82.58%, and 82.19% obtained using MLP, PNN, and DT respectively. It is observed that the present results outperform those of prior research. The overall classification accuracies for the three risk levels of phishing attacks using the classifiers MLP, PNN, and DT are also superior.

[1]  Vadlamani Ravi,et al.  Detection of financial statement fraud and feature selection using data mining techniques , 2011, Decis. Support Syst..

[2]  Cheng Hsiao,et al.  Missing data and maximum likelihood estimation , 1980 .

[3]  Markus Jakobsson,et al.  Phishing and Countermeasures: Understanding the Increasing Problem of Electronic Identity Theft , 2006 .

[4]  Markus Jakobsson,et al.  Social phishing , 2007, CACM.

[5]  Soo-Young Lee,et al.  Training Algorithm with Incomplete Data for Feed-Forward Neural Networks , 1999, Neural Processing Letters.

[6]  M. Workman Wisecrackers: A theory-grounded investigation of phishing and pretext social engineering threats to information security , 2008 .

[7]  Olivia R. Liu Sheng,et al.  Discovering company revenue relations from news: A network approach , 2009, Decis. Support Syst..

[8]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[9]  Amit Gupta,et al.  Estimating Missing Values Using Neural Networks , 1996 .

[10]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[11]  Gustavo E. A. P. A. Batista,et al.  Experimental comparison pf K-NEAREST NEIGHBOUR and MEAN OR MODE imputation methods with the internal strategies used by C4.5 and CN2 to treat missing data , 2003 .

[12]  Khaled El Emam,et al.  Software Cost Estimation with Incomplete Data , 2001, IEEE Trans. Software Eng..

[13]  Bogdan Gabrys,et al.  Neuro-fuzzy approach to processing inputs with missing values in pattern recognition problems , 2002, Int. J. Approx. Reason..

[14]  Leslie S. Smith,et al.  A neural network-based framework for the reconstruction of incomplete data sets , 2010, Neurocomputing.

[15]  Chih-Ping Wei,et al.  Accommodating Individual Preferences in the Categorization of Documents: A Personalized Clustering Approach , 2006, J. Manag. Inf. Syst..

[16]  N. P. Singh Online Frauds in Banks with Phishing , 2007 .

[17]  Padmini Srinivasan,et al.  Text mining: Generating hypotheses from MEDLINE , 2004, J. Assoc. Inf. Sci. Technol..

[18]  S. Nordbotten Neural network imputation applied to the Norwegian 1990 population census data , 1996 .

[19]  Manas Ranjan Patra,et al.  Web-services classification using intelligent techniques , 2010, Expert Syst. Appl..

[20]  Peter K. Sharpe,et al.  Dealing with missing values in neural network-based diagnostic systems , 1995, Neural Computing & Applications.

[21]  Leonardo Franco,et al.  Missing data imputation in breast cancer prognosis , 2006 .

[22]  Donald F. Specht,et al.  Probabilistic neural networks , 1990, Neural Networks.

[23]  E. Airoldi,et al.  Data Mining Challenges for Electronic Safety: The Case of Fraudulent Intent Detection in E-Mails , 2004 .

[24]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[25]  Tshilidzi Marwala,et al.  The use of genetic algorithms and neural networks to approximate missing data in database , 2005, IEEE 3rd International Conference on Computational Cybernetics, 2005. ICCC 2005..

[26]  Xi Chen,et al.  Assessing the severity of phishing attacks: A hybrid data mining approach , 2011, Decis. Support Syst..

[27]  Qinbao Song,et al.  A new imputation method for small software project data sets , 2007, J. Syst. Softw..

[28]  Amaury Lendasse,et al.  X-SOM and L-SOM: A double classification approach for missing value imputation , 2010, Neurocomputing.

[29]  M. Marseguerra,et al.  The AutoAssociative Neural Network in signal analysis: II. Application to on-line monitoring of a simulated BWR component , 2005 .

[30]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[31]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[32]  Shouhong Wang,et al.  The Use of Ontology for Data Mining with Incomplete Data , 2010, Principle Advancements in Database Management Technologies.

[33]  Carolyn F. Holton,et al.  Identifying disgruntled employee systems fraud risk through text mining: A simple solution for a multi-billion dollar problem , 2009, Decis. Support Syst..