One-Class Text Document Classification with OCSVM and LSI

In this paper, we propose a novel one-class classification approach for text document classification using One-Class Support Vector Machine (OCSVM) and Latent Semantic Indexing (LSI) in tandem. We first apply t-statistic-based feature selection on the text corpus. Then, we apply OCSVM on the rows corresponding to the negative class of the document-term matrix of a collection of text documents and extract the Support Vectors (SV). Then, in the test phase, we employ LSI on the query documents from the positive class to compare them with the SVs extracted from the negative class and match score is computed using the cosine similarity measure. Then, based on a prespecified threshold for the match score, we classify the positive category of the text corpus. Use of SV for comparison reduces the computational load, which is the main contribution of the paper. We demonstrated the effectiveness of our approach on the datasets pertaining to Phishing, and sentiment analysis in a bank.

[1]  Niels Provos,et al.  A framework for detection and measurement of phishing attacks , 2007, WORM '07.

[2]  Bart Baesens,et al.  Social network analysis for customer churn prediction , 2014, Appl. Soft Comput..

[3]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[4]  Richard A. Harshman,et al.  Information Retrieval using a Singular Value Decomposition Model of Latent Semantic Structure , 1988, SIGIR Forum.

[5]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[6]  Xi Chen,et al.  Assessing the severity of phishing attacks: A hybrid data mining approach , 2011, Decis. Support Syst..

[7]  Chang Liu,et al.  Anomaly detection in surveillance video using motion direction statistics , 2010, 2010 IEEE International Conference on Image Processing.

[8]  Sougata Mukherjea,et al.  Social ties and their relevance to churn in mobile telecom networks , 2008, EDBT '08.

[9]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[10]  Ian H. Witten,et al.  One-Class Classification by Combining Density and Class Probability Estimation , 2008, ECML/PKDD.

[11]  Mingxing He,et al.  An efficient phishing webpage detector , 2011, Expert Syst. Appl..

[12]  Suku Nair,et al.  A comparison of machine learning techniques for phishing detection , 2007, eCrime '07.

[13]  Hanqing Lu,et al.  Face detection using one-class-based support vectors , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[14]  Aristides Gionis,et al.  Social Network Analysis and Mining for Business Applications , 2011, TIST.

[15]  Dirk Thorleuchter,et al.  Technology classification with latent semantic indexing , 2013, Expert Syst. Appl..

[16]  Muhammad Zubair Shafiq,et al.  Using spatio-temporal information in API calls with machine learning algorithms for malware detection , 2009, AISec '09.

[17]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[18]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[19]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[20]  Mayank Pandey,et al.  Text and Data Mining to Detect Phishing Websites and Spam Emails , 2013, SEMCCO.

[21]  Fadi A. Thabtah,et al.  Phishing detection based Associative Classification data mining , 2014, Expert Syst. Appl..

[22]  Vadlamani Ravi,et al.  Detecting phishing e-mails using text and data mining , 2012, 2012 IEEE International Conference on Computational Intelligence and Computing Research.

[23]  Salvatore J. Stolfo,et al.  Data Mining Approaches for Intrusion Detection , 1998, USENIX Security Symposium.

[24]  Vadlamani Ravi,et al.  Malware detection by text and data mining , 2013, 2013 IEEE International Conference on Computational Intelligence and Computing Research.

[25]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[26]  Yanfang Ye,et al.  IMDS: intelligent malware detection system , 2007, KDD '07.

[27]  Wei Song,et al.  Genetic algorithm for text clustering based on latent semantic indexing , 2009, Comput. Math. Appl..

[28]  Thomas S. Huang,et al.  One-class SVM for learning in image retrieval , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[29]  Cheng Hua Li,et al.  An efficient document classification model using an improved back propagation neural network and singular value decomposition , 2009, Expert Syst. Appl..

[30]  Christopher Krügel,et al.  On the Effectiveness of Techniques to Detect Phishing Sites , 2007, DIMVA.