Semantic Code Search for Smart Contracts

Semantic code search technology allows searching for existing code snippets through natural language, which can greatly improve programming efficiency. Smart contracts, programs that run on the blockchain, have a code reuse rate of more than 90%, which means developers have a great demand for semantic code search tools. However, the existing code search models still have a semantic gap between code and query, and perform poorly on specialized queries of smart contracts. In this paper, we propose a Multi-Modal Smart contract Code Search (MM-SCS) model. Specifically, we construct a Contract Elements Dependency Graph (CEDG) for MM-SCS as an additional modality to capture the data-flow and control-flow information of the code. To make the model more focused on the key contextual information, we use a multi-head attention network to generate embeddings for code features. In addition, we use a finetuned pretrained model to ensure the model’s effectiveness when the training data is small. We compared MM-SCS with four stateof-the-art models on a dataset with 470K (code, docstring) pairs collected from Github and Etherscan. Experimental results show that MM-SCS achieves an MRR (Mean Reciprocal Rank) of 0.572, outperforming four state of-the-art models UNIF, DeepCS, CARLCSCNN, and TAB-CS by 34.2%, 59.3%, 36.8%, and 14.1%, respectively. Additionally, the search speed of MM-SCS is second only to UNIF, reaching 0.34s/query.

[1]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[2]  Rabab Kreidieh Ward,et al.  Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Torsten Hoefler,et al.  Neural Code Comprehension: A Learnable Representation of Code Semantics , 2018, NeurIPS.

[4]  Koushik Sen,et al.  When deep learning met code search , 2019, ESEC/SIGSOFT FSE.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Xin Xia,et al.  Improving Code Search with Co-Attentive Representation Learning , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[7]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[8]  Zhi Jin,et al.  Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree , 2020, 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[9]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[10]  Razvan C. Bunescu,et al.  Learning to rank relevant files for bug reports using domain knowledge , 2014, SIGSOFT FSE.

[11]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[12]  Konrad Rieck,et al.  Modeling and Discovering Vulnerabilities with Code Property Graphs , 2014, 2014 IEEE Symposium on Security and Privacy.

[13]  Fazli Can,et al.  Information Retrieval Data Structures & Algorithms, by William B. Frakes and Ricardo Baeza-Yates (Book Review) , 1993, SIGIR Forum.

[14]  Xuan Li,et al.  Relationship-aware code search for JavaScript frameworks , 2016, SIGSOFT FSE.

[15]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[16]  Zhou Xu,et al.  Two-Stage Attention-Based Model for Code Search with Textual and Structural Features , 2021, 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).

[17]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[18]  Manohar Kaul,et al.  Learning Attention-based Embeddings for Relation Prediction in Knowledge Graphs , 2019, ACL.

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[21]  Dongmei Zhang,et al.  CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[22]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[23]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[24]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[25]  Koushik Sen,et al.  Retrieval on source code: a neural code search , 2018, MAPL@PLDI.