Expert Systems With Applications

Abstract Code obfuscation is a staple tool in malware creation where code fragments are altered substantially to make them appear different from the original, while keeping the semantics unaffected. A majority of the obfuscated code detection methods use program structure as a signature for detection of unknown codes. They usually ignore the most important feature, which is the semantics of the code, to match two code fragments or programs for obfuscation. Obfuscated code detection is a special case of the semantic code clone detection task. We propose a detection framework for detecting both code obfuscation and clone using machine learning. We use features extracted from Java bytecode dependency graphs (BDG), program dependency graphs (PDG) and abstract syntax trees (AST). BDGs and PDGs are two representations of the semantics or meaning of a Java program. ASTs capture the structural aspects of a program. We use several publicly available code clone and obfuscated code datasets to validate the effectiveness of our framework. We use different assessment parameters to evaluate the detection quality of our proposed model. Experimental results are excellent when compared with contemporary obfuscated code and code clone detectors. Interestingly, we achieve 100% success in detecting obfuscated code based on recall, precision, and F1-Score. When we compare our method with other methods for all of obfuscations types, viz, contraction, expansion, loop transformation and renaming, our model appears to be the winner. In case of clone detection our model achieve very high detection accuracy in comparison to other similar detectors.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1984, TOPL.

[3]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[4]  Shinji Kusumoto,et al.  Folding Repeated Instructions for Improving Token-Based Code Clone Detection , 2012, 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation.

[5]  Chih-Fong Tsai,et al.  Combining multiple feature selection methods for stock prediction: Union, intersection, and multi-intersection approaches , 2010, Decis. Support Syst..

[6]  Sumit Kumar Yadav,et al.  A hybrid-token and textual based approach to find similar code segments , 2013, 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT).

[7]  Yao Wang,et al.  A deep learning approach for detecting malicious JavaScript code , 2016, Secur. Commun. Networks.

[8]  Yang Yuan,et al.  Boreas: an accurate and scalable token-based approach to code clone detection , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[9]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[10]  Christian S. Collberg,et al.  Watermarking, Tamper-Proofing, and Obfuscation-Tools for Software Protection , 2002, IEEE Trans. Software Eng..

[11]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[12]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[15]  Thomas W. Reps,et al.  On the adequacy of program dependence graphs for representing programs , 1988, POPL '88.

[16]  Chanchal Kumar Roy,et al.  NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[17]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[18]  R. Radhika,et al.  Detection of Type-1 and Type-2 Code Clones Using Textual Analysis and Metrics , 2010, 2010 International Conference on Recent Trends in Information, Telecommunication and Computing.

[19]  Gilles Louppe,et al.  Understanding variable importances in forests of randomized trees , 2013, NIPS.

[20]  Yang Yuan,et al.  CMCD: Count Matrix Based Code Clone Detection , 2011, 2011 18th Asia-Pacific Software Engineering Conference.

[21]  Yajin Zhou,et al.  Dissecting Android Malware: Characterization and Evolution , 2012, 2012 IEEE Symposium on Security and Privacy.

[22]  Shinji Kusumoto,et al.  Gapped code clone detection with lightweight source code analysis , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[23]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[24]  Somesh Jha,et al.  Semantics-aware malware detection , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[25]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[26]  Trevor Hastie,et al.  Additive Logistic Regression : a Statistical , 1998 .

[27]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[28]  Shinji Kusumoto,et al.  How Accurate Is Coarse-grained Clone Detection?: Comparision with Fine-grained Detectors , 2014, Electron. Commun. Eur. Assoc. Softw. Sci. Technol..

[29]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[30]  Elmar Jürgens,et al.  Do code clones matter? , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[31]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.

[32]  Cristina V. Lopes,et al.  SourcererCC and SourcererCC-I: Tools to Detect Clones in Batch Mode and during Software Development , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[33]  Elmar Jürgens,et al.  Index-based code clone detection: incremental, distributed, scalable , 2010, 2010 IEEE International Conference on Software Maintenance.

[34]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[35]  Steven L. Salzberg,et al.  Book Review: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993 , 1994, Machine Learning.

[36]  Ingoo Han,et al.  An evolutionary approach to the combination of multiple classifiers to predict a stock price index , 2006, Expert Syst. Appl..

[37]  Sandro Schulze,et al.  On the robustness of clone detection to code obfuscation , 2013, 2013 7th International Workshop on Software Clones (IWSC).

[38]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  David Clark,et al.  Similarity of Source Code in the Presence of Pervasive Modifications , 2016, 2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[41]  Steven Salzberg,et al.  Programs for Machine Learning , 2004 .

[42]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[43]  Eunjin Jung,et al.  Obfuscated malicious javascript detection using classification techniques , 2009, 2009 4th International Conference on Malicious and Unwanted Software (MALWARE).

[44]  Shinji Kusumoto,et al.  Incremental Code Clone Detection: A PDG-based Approach , 2011, 2011 18th Working Conference on Reverse Engineering.

[45]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[46]  Kieran McLaughlin,et al.  Detecting obfuscated malware using reduced opcode set and optimised runtime trace , 2016, Security Informatics.

[47]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[48]  Jugal K. Kalita,et al.  Schemes for Labeling Semantic Code Clones using Machine Learning , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[49]  Ying Zou,et al.  Detecting Android Malware Using Clone Detection , 2015, Journal of Computer Science and Technology.

[50]  Jugal Kalita,et al.  Code clone detection using coarse and fine-grained hybrid approaches , 2015, 2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS).