Cyber-guided Deep Neural Network for Malicious Repository Detection in GitHub

As the largest source code repository, GitHub has played a vital role in modern social coding ecosystem to generate production software. Despite the apparent benefits of such social coding paradigm, its potential security risks have been largely overlooked (e.g., malicious codes or repositories could be easily embedded and distributed). To address this imminent issue, in this paper, we propose a novel framework (named GitCyber) to automate malicious repository detection in GitHub at the first attempt. In GitCyber, we first extract code contents from the repositories hosted in GitHub as the inputs for deep neural network (DNN), and then we incorporate cybersecurity domain knowledge modeled by heterogeneous information network (HIN) to design cyber-guided loss function in the learning objective of the DNN to assure the classification performance while preserving consistency with the observational domain knowledge. Comprehensive experiments based on the large-scale data collected from GitHub demonstrate that our proposed GitCyber outperforms the state-of-the-arts in malicious repository detection.

[1]  Yanfang Ye,et al.  Out-of-sample Node Representation Learning for Heterogeneous Graph in Real-time Android Malware Detection , 2019, IJCAI.

[2]  Yanfang Ye,et al.  αCyber: Enhancing Robustness of Android Malware Detection System against Adversarial Attacks on Heterogeneous Graph based Model , 2019, CIKM.

[3]  Cheng-Lin Liu,et al.  Data-Distortion Guided Self-Distillation for Deep Neural Networks , 2019, AAAI.

[4]  Anuj Karpatne,et al.  Physics-guided Neural Networks (PGNN): An Application in Lake Temperature Modeling , 2017, ArXiv.

[5]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[6]  Christos Faloutsos,et al.  Polonium: Tera-Scale Graph Mining and Inference for Malware Detection , 2011 .

[7]  Zhi Zhang,et al.  Knowledge-Guided Deep Fractal Neural Networks for Human Pose Estimation , 2017, IEEE Transactions on Multimedia.

[8]  Michelle L. Mazurek,et al.  Security Developer Studies with GitHub Users: Exploring a Convenience Sample , 2017, SOUPS.

[9]  Kelly Blincoe,et al.  Understanding the popular users: Following, affiliation influence and leadership on GitHub , 2016, Inf. Softw. Technol..

[10]  Dipankar Dasgupta,et al.  A survey of blockchain from security perspective , 2019, J. Bank. Financial Technol..

[11]  Yanfang Ye,et al.  Gotcha - Sly Malware!: Scorpion A Metagraph2vec Based Malware Detection System , 2018, KDD.

[12]  Alexander Serebrenik,et al.  Security and emotion: sentiment analysis of security discussions on GitHub , 2014, MSR 2014.

[13]  S. Sitharama Iyengar,et al.  A Survey on Malware Detection Using Data Mining Techniques , 2017, ACM Comput. Surv..

[14]  Yanfang Ye,et al.  Combining file content and file relations for cloud based malware detection , 2011, KDD.

[15]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[16]  Xin Li,et al.  Your Style Your Identity: Leveraging Writing and Photography Styles for Drug Trafficker Identification in Darknet Markets over Attributed Heterogeneous Information Network , 2019, WWW.

[17]  Alexander Serebrenik,et al.  StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge , 2013, 2013 International Conference on Social Computing.

[18]  Shouhuai Xu,et al.  ICSD: An Automatic System for Insecure Code Snippet Detection in Stack Overflow over Heterogeneous Information Network , 2018, ACSAC.

[19]  Yanfang Ye,et al.  HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network , 2017, KDD.

[20]  Xin Li,et al.  Automatic Opioid User Detection from Twitter: Transductive Ensemble Built on Different Meta-graph Based Similarities over Heterogeneous Information Network , 2018, IJCAI.

[21]  David Lo,et al.  Network Structure of Social Coding in GitHub , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[22]  Jian Liu,et al.  iDetector: Automate Underground Forum Analysis Based on Heterogeneous Information Network , 2018, 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[23]  Charu C. Aggarwal,et al.  Co-author Relationship Prediction in Heterogeneous Bibliographic Networks , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[24]  Xiang Li,et al.  Meta Structure: Computing Relevance in Large Heterogeneous Information Networks , 2016, KDD.

[25]  Alessandro Bozzon,et al.  Linking Accounts across Social Networks: the Case of StackOverflow, Github and Twitter , 2015, KDWeb.

[26]  Polina Zhinalieva,et al.  Graph-based malware distributors detection , 2013, WWW '13 Companion.

[27]  Yanfang Ye,et al.  Key Player Identification in Underground Forums over Attributed Heterogeneous Information Network Embedding Framework , 2019, CIKM.

[28]  Lawrence O. Hall,et al.  Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub , 2019, 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI).

[29]  Baishakhi Ray,et al.  Cross-project code clones in GitHub , 2018, Empirical Software Engineering.

[30]  Shouhuai Xu,et al.  iDev: Enhancing Social Coding Security by Cross-platform User Identification Between GitHub and Stack Overflow , 2019, IJCAI.

[31]  Giuseppe Bianco,et al.  Toxic Code Snippets on Stack Overflow , 2018, IEEE Transactions on Software Engineering.

[32]  Dik Lun Lee,et al.  Meta-Graph Based Recommendation Fusion over Heterogeneous Information Networks , 2017, KDD.

[33]  Yun Fu,et al.  Examples-Rules Guided Deep Neural Network for Makeup Recommendation , 2017, AAAI.

[34]  Nitesh V. Chawla,et al.  metapath2vec: Scalable Representation Learning for Heterogeneous Networks , 2017, KDD.

[35]  Xin Li,et al.  Social Media for Opioid Addiction Epidemiology: Automatic Detection of Opioid Addicts from Twitter and Case Studies , 2017, CIKM.