Cross-Language Binary-Source Code Matching with Intermediate Representations

Binary-source code matching plays an important role in many security and software engineering related tasks such as malware detection, reverse engineering and vulnerability assessment. Currently, several approaches have been proposed for binary-source code matching by jointly learning the embeddings of binary code and source code in a common vector space. Despite much effort, existing approaches target on matching the binary code and source code written in a single programming language. However, in practice, software applications are often written in different programming languages to cater for different requirements and computing platforms. Matching binary and source code across programming languages introduces additional challenges when maintaining multi-language and multi-platform applications. To this end, this paper formulates the problem of cross-language binary-source code matching, and develops a new dataset for this new problem. We present a novel approach XLIR, which is a Transformer-based neural network by learning the intermediate representations for both binary and source code. To validate the effectiveness of XLIR, comprehensive experiments are conducted on two tasks of cross-language binary-source code matching, and cross-language source-source code matching, on top of our curated dataset. Experimental results and analysis show that our proposed XLIR with intermediate representations significantly outperforms other state-of-the-art models in both of the two tasks.

[1]  Dawn Xiaodong Song,et al.  Tree-to-tree Neural Networks for Program Translation , 2018, NeurIPS.

[2]  Philip S. Yu,et al.  Multi-modal Attention Network Learning for Semantic Source Code Retrieval , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[3]  Shouhuai Xu,et al.  VulPecker: an automated vulnerability detection system based on code similarity analysis , 2016, ACSAC.

[4]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[5]  Chanchal Kumar Roy,et al.  NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[6]  Di He,et al.  How could Neural Networks understand Programs? , 2021, ICML.

[7]  Jiaqi Wang,et al.  CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching , 2020, NeurIPS.

[8]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[9]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[10]  Philip S. Yu,et al.  Improving Automatic Source Code Summarization via Deep Reinforcement Learning , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[11]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[12]  David Lo,et al.  Cross-language bug localization , 2014, ICPC 2014.

[13]  Ming Li,et al.  Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code , 2017, IJCAI.

[14]  Fang Liu,et al.  A Self-Attentional Neural Architecture for Code Completion with Multi-Task learning , 2019, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[15]  Yijun Yu,et al.  SAR: learning cross-language API mappings with little knowledge , 2019, ESEC/SIGSOFT FSE.

[16]  Lingxiao Jiang,et al.  Bilateral Dependency Neural Networks for Cross-Language Algorithm Classification , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[17]  Philip S. Yu,et al.  NaturalCC: A Toolkit to Naturalize the Source Code Corpus , 2020, ArXiv.

[18]  Satish Chandra,et al.  Code Prediction by Feeding Trees to Transformers , 2020, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[19]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[20]  Zhen Huang,et al.  BinPro: A Tool for Binary Source Code Provenance , 2017, ArXiv.

[21]  Christoph Meinel,et al.  BMXNet: An Open-Source Binary Neural Network Implementation Based on MXNet , 2017, ACM Multimedia.

[22]  Arash Shahkar,et al.  On Matching Binary to Source Code , 2016 .

[23]  Maunendra Sankar Desarkar,et al.  IR2Vec: LLVM IR based Scalable Program Embeddings , 2019 .

[24]  Junzhou Huang,et al.  Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection , 2020, AAAI.

[25]  Jeronimo Castrillon,et al.  Compiler-based graph representations for deep learning models of code , 2020, CC.

[26]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[27]  Collin McMillan,et al.  Improved Code Summarization via a Graph Neural Network , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[28]  Wei Hua,et al.  FCCA: Hybrid Code Representation for Functional Clone Detection Using Attention Networks , 2020, IEEE Transactions on Reliability.

[29]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[30]  Xiaodong Gu,et al.  DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning , 2017, IJCAI.

[31]  Hong Liang,et al.  SCDetector: Software Functional Clone Detection Based on Semantic Tokens Analysis , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[32]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[33]  Wei Huo,et al.  B2SFinder: Detecting Open-Source Software Reuse in COTS Software , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[34]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[35]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[36]  Tao Wang,et al.  TBCNN: A Tree-Based Convolutional Neural Network for Programming Language Processing , 2014, ArXiv.

[37]  Zoran Budimac,et al.  LICCA: A tool for cross-language clone detection , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[38]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[39]  Ming Zhou,et al.  GraphCodeBERT: Pre-training Code Representations with Data Flow , 2020, ICLR.

[40]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[41]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[42]  M. Wegman,et al.  Global value numbers and redundant computations , 1988, POPL '88.

[43]  Torsten Hoefler,et al.  Neural Code Comprehension: A Learnable Representation of Code Semantics , 2018, NeurIPS.

[44]  Dawn Song,et al.  Neural Code Completion , 2017 .

[45]  Neel Sundaresan,et al.  Pythia: AI-assisted Code Completion System , 2019, KDD.

[46]  Hailong Sun,et al.  A Novel Neural Source Code Representation Based on Abstract Syntax Tree , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[47]  Kevin A. Schneider,et al.  CLCDSA: Cross Language Code Clone Detection using Syntactical Features and API Documentation , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[48]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[49]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[50]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[51]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[52]  Gang Zhao,et al.  DeepSim: deep learning code functional similarity , 2018, ESEC/SIGSOFT FSE.

[53]  Yulei Sui,et al.  Flow2Vec: value-flow-based precise code embedding , 2020, Proc. ACM Program. Lang..