ICSD: An Automatic System for Insecure Code Snippet Detection in Stack Overflow over Heterogeneous Information Network

As the popularity of modern social coding paradigm such as Stack Overflow grows, its potential security risks increase as well (e.g., insecure codes could be easily embedded and distributed). To address this largely overlooked issue, in this paper, we bring an important new insight to exploit social coding properties in addition to code content for automatic detection of insecure code snippets in Stack Overflow. To determine if the given code snippets are insecure, we not only analyze the code content, but also utilize various kinds of relations among users, badges, questions, answers, code snippets and keywords in Stack Overflow. To model the rich semantic relationships, we first introduce a structured heterogeneous information network (HIN) for representation and then use meta-path based approach to incorporate higher-level semantics to build up relatedness over code snippets. Later, we propose a novel network embedding model named snippet2vec for representation learning in HIN where both the HIN structures and semantics are maximally preserved. After that, a multi-view fusion classifier is constructed for insecure code snippet detection. To the best of our knowledge, this is the first work utilizing both code content and social coding properties to address the code security issues in modern software coding platforms. Comprehensive experiments on the data collections from Stack Overflow are conducted to validate the effectiveness of the developed system ICSD which integrates our proposed method in insecure code snippet detection by comparisons with alternative approaches.

[1]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Philip S. Yu,et al.  A Survey of Heterogeneous Information Network Analysis , 2015, IEEE Transactions on Knowledge and Data Engineering.

[3]  Yanfang Ye,et al.  DroidDelver: An Android Malware Detection System Using Deep Belief Network Based on API Call Blocks , 2016, WAIM Workshops.

[4]  Wang-Chien Lee,et al.  HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning , 2017, CIKM.

[5]  Yizhou Sun,et al.  Mining Heterogeneous Information Networks: Principles and Methodologies , 2012, Mining Heterogeneous Information Networks: Principles and Methodologies.

[6]  Shouhuai Xu,et al.  DroidEye: Fortifying Security of Learning-Based Classifier Against Adversarial Android Malware Attacks , 2018, 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[7]  Christoph Treude,et al.  How do programmers ask and answer questions on the web?: NIER track , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[8]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9]  Xin Li,et al.  Automatic Opioid User Detection from Twitter: Transductive Ensemble Built on Different Meta-graph Based Similarities over Heterogeneous Information Network , 2018, IJCAI.

[10]  S. Sitharama Iyengar,et al.  A Survey on Malware Detection Using Data Mining Techniques , 2017, ACM Comput. Surv..

[11]  Charu C. Aggarwal,et al.  Co-author Relationship Prediction in Heterogeneous Bibliographic Networks , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[12]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[13]  Yanfang Ye,et al.  SecureDroid: Enhancing Security of Machine Learning-based Detection against Adversarial Android Malware Attacks , 2017, ACSAC.

[14]  Yanfang Ye,et al.  Deep4MalDroid: A Deep Learning Framework for Android Malware Detection Based on Linux Kernel System Call Graphs , 2016, 2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW).

[15]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[16]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[17]  Gabriele Bavota,et al.  How do API changes trigger stack overflow discussions? a study on the Android SDK , 2014, ICPC 2014.

[18]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[19]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .

[20]  Michael Backes,et al.  You Get Where You're Looking for: The Impact of Information Sources on Code Security , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[21]  Alexander Serebrenik,et al.  StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge , 2013, 2013 International Conference on Social Computing.

[22]  Huseyin Cavusoglu,et al.  Can Gamification Motivate Voluntary Contributions?: The Case of StackOverflow Q&A Community , 2015, CSCW Companion.

[23]  Daniel Czyczyn-Egird,et al.  Determining the Popularity of Design Patterns Used by Programmers Based on the Analysis of Questions and Answers on Stackoverflow.com Social Network , 2016, CN.

[24]  Sebastian Deterding,et al.  Gamification: designing for motivation , 2012, INTR.

[25]  Ilse C. F. Ipsen,et al.  The Angle Between Complementary Subspaces , 1995 .

[26]  Jiawei Han,et al.  Meta-Path Guided Embedding for Similarity Search in Large-Scale Heterogeneous Information Networks , 2016, ArXiv.

[27]  Leif Singer,et al.  Assessing Technical Candidates on the Social Web , 2013, IEEE Software.

[28]  Abhishek Srivastava,et al.  Understanding and evaluating the behavior of technical users. A study of developer interaction at StackOverflow , 2017, Human-centric Computing and Information Sciences.

[29]  C. J. van Rijsbergen,et al.  Semantic Spaces: Measuring the Distance between Different Subspaces , 2009, QI.

[30]  David Lo,et al.  GitHub and Stack Overflow: Analyzing Developer Interests Across Multiple Social Collaborative Platforms , 2017, SocInfo.

[31]  Nicole Novielli,et al.  How to ask for technical help? Evidence-based guidelines for writing questions on Stack Overflow , 2017, Inf. Softw. Technol..

[32]  Dik Lun Lee,et al.  Meta-Graph Based Recommendation Fusion over Heterogeneous Information Networks , 2017, KDD.

[33]  Nitesh V. Chawla,et al.  metapath2vec: Scalable Representation Learning for Heterogeneous Networks , 2017, KDD.

[34]  Xin Li,et al.  Social Media for Opioid Addiction Epidemiology: Automatic Detection of Opioid Addicts from Twitter and Case Studies , 2017, CIKM.

[35]  Jacob Aristotle,et al.  Stack Overflow , 2012 .

[36]  Vincent Hellendoorn,et al.  Perceived language complexity in GitHub issue discussions and their effect on issue resolution , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[37]  Yanfang Ye,et al.  Gotcha - Sly Malware!: Scorpion A Metagraph2vec Based Malware Detection System , 2018, KDD.

[38]  Hao Zhong,et al.  Mining stackoverflow for program repair , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[39]  Qiaozhu Mei,et al.  PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks , 2015, KDD.

[40]  Peter D. Hoff,et al.  Latent Space Approaches to Social Network Analysis , 2002 .

[41]  Michael Backes,et al.  Stack Overflow Considered Harmful? The Impact of Copy&Paste on Android Application Security , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[42]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[43]  Christopher Krügel,et al.  Execute This! Analyzing Unsafe and Malicious Dynamic Code Loading in Android Applications , 2014, NDSS.

[44]  Yanfang Ye,et al.  HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network , 2017, KDD.

[45]  John Coogle,et al.  StackInTheFlow: StackOverflow Search Engine , 2017 .