A Hybrid Approach Using Topic Modeling and Class-Association Rule Mining for Text Classification: the Case of Malware Detection

We propose a novel general-purpose hybrid method comprising topic modeling and Class Association Rule Mining (CARM) for text classification in tandem. While topic modeling performs dimension reduction, association rule mining aspect is taken care by Apriori and Frequent Pattern(FP)- growth algorithms, separately. In order to illustrate the effectiveness of the proposed method, malware prediction using two publicly available datasets of API calls has been performed. The proposed model has generated highly accurate class association rules and Area Under the Curve (AUC) compare to the extant models in the literature. With the help of statistical significance test, it is concluded that the performances of both proposed hybrid models, i.e., topic modelina with FP-2rowth and Apriori, are same.

[1]  Sattar Hashemi,et al.  Malware detection based on mining API calls , 2010, SAC '10.

[2]  Fadi A. Thabtah,et al.  Associative Classification Approaches: Review and Comparison , 2014, J. Inf. Knowl. Manag..

[3]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[4]  Muhammad Zubair Shafiq,et al.  Using spatio-temporal information in API calls with machine learning algorithms for malware detection , 2009, AISec '09.

[5]  Arun K. Pujari,et al.  N-gram analysis for computer virus detection , 2006, Journal in Computer Virology.

[6]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[7]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[8]  Haym Hirsh,et al.  Mining Associations in Text in the Presence of Background Knowledge , 1996, KDD.

[9]  Amin Azmoodeh,et al.  Graph embedding as a new approach for unknown malware detection , 2017, Journal of Computer Virology and Hacking Techniques.

[10]  Chong Wang,et al.  Collaborative topic modeling for recommending scientific articles , 2011, KDD.

[11]  Md. Rafiqul Islam,et al.  Defending unknown attacks on cyber-physical systems by semi-supervised approach and available unlabeled data , 2017, Inf. Sci..

[12]  Ramakrishnan Srikant,et al.  Discovering Trends in Text Databases , 1997, KDD.

[13]  Partha Pratim Talukdar,et al.  Associating structured records to text documents , 2012, WWW.

[14]  Fadi A. Thabtah,et al.  MAC: A Multiclass Associative Classification Algorithm , 2012, J. Inf. Knowl. Manag..

[15]  Venu Govindaraju,et al.  Malware detection via API calls, topic models and machine learning , 2015, 2015 IEEE International Conference on Automation Science and Engineering (CASE).

[16]  Vadlamani Ravi,et al.  Malware detection by text and data mining , 2013, 2013 IEEE International Conference on Computational Intelligence and Computing Research.

[17]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[18]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[19]  Li Chen,et al.  News impact on stock price return via sentiment analysis , 2014, Knowl. Based Syst..

[20]  Yanfang Ye,et al.  Malicious sequential pattern mining for automatic malware detection , 2016, Expert Syst. Appl..

[21]  Baogang Wei,et al.  Incorporating Probabilistic Knowledge into Topic Models , 2015, PAKDD.

[22]  Ashkan Sami,et al.  MAAR: Robust features to detect malicious activity based on API calls, their arguments and return values , 2017, Eng. Appl. Artif. Intell..

[23]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[24]  Peter I. Cowling,et al.  MMAC: a new multi-class, multi-label associative classification approach , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[25]  Stephanie Forrest,et al.  A sense of self for Unix processes , 1996, Proceedings 1996 IEEE Symposium on Security and Privacy.

[26]  Xiaofeng Wang,et al.  An approach for adaptive associative classification , 2011, Expert Syst. Appl..

[27]  M. Narasimha Murty,et al.  Efficient classification using phrases generated by topic models , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[28]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[29]  Fadi A. Thabtah,et al.  Phishing detection based Associative Classification data mining , 2014, Expert Syst. Appl..

[30]  Kieran McLaughlin,et al.  SVM Training Phase Reduction Using Dataset Feature Filtering for Malware Detection , 2013, IEEE Transactions on Information Forensics and Security.

[31]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[32]  Vadlamani Ravi,et al.  Particle Swarm Optimization Trained Class Association Rule Mining: Application to Phishing Detection , 2016, ICIA.

[33]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[34]  Jochen Dörre,et al.  Text mining: finding nuggets in mountains of textual data , 1999, KDD '99.

[35]  Srinivas Mukkamala,et al.  Kernel machines for malware classification and similarity analysis , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[36]  Mamoun Alazab,et al.  Towards Understanding Malware Behaviour by the Extraction of API Calls , 2010, 2010 Second Cybercrime and Trustworthy Computing Workshop.

[37]  Arno Scharl,et al.  Enriching semantic knowledge bases for opinion mining in big data applications , 2014, Knowl. Based Syst..

[38]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[39]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.