LabelGit: A Dataset for Software Repositories Classification using Attributed Dependency Graphs

Software repository hosting services contain large amounts of open-source software, with GitHub hosting more than 100 million repositories, from new to established ones. Given this vast amount of projects, there is a pressing need for a search based on the software’s content and features. However, even though GitHub offers various solutions to aid software discovery, most repositories do not have any labels, reducing the utility of search and topic-based analysis. Moreover, classifying software modules is also getting more importance given the increase in Component-Based Software Development. However, previous work focused on software classification using keywordbased approaches or proxies for the project (e.g., README), which is not always available. In this work, we create a new annotated dataset of GitHub Java projects called LabelGit. Our dataset uses direct information from the source code, like the dependency graph and source code neural representations from the identifiers. Using this dataset, we hope to aid the development of solutions that do not rely on proxies but use the entire source code to perform classification.

[1]  Mario Linares Vásquez,et al.  Automated Tagging of Software Projects Using Bytecode and Dependencies (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[2]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[3]  David Lo,et al.  Cataloging GitHub Repositories , 2017, EASE.

[4]  Davide Di Ruscio,et al.  A Multinomial Naïve Bayesian (MNB) Network to Automatically Recommend Topics for GitHub Repositories , 2020, EASE.

[5]  Juri Di Rocco,et al.  Detecting Java software similarities by using different clustering techniques , 2020, Inf. Softw. Technol..

[6]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[7]  Luis Perez,et al.  The Effectiveness of Data Augmentation in Image Classification using Deep Learning , 2017, ArXiv.

[8]  Sjaak Brinkkemper,et al.  The accuracy of dependency analysis in static architecture compliance checking , 2017, Softw. Pract. Exp..

[9]  Frank F. Xu,et al.  HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[10]  Marcus Soll,et al.  ClassifyHub: An Algorithm to Classify GitHub Repositories , 2017, KI.

[11]  Yan Xiao,et al.  The effectiveness of data augmentation in code readability classification , 2021, Inf. Softw. Technol..

[12]  Abbas Heydarnoori,et al.  Topic recommendation for software repositories using multi-label classification algorithms , 2020, Empirical Software Engineering.

[13]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[14]  Pierre Vandergheynst,et al.  Geometric Deep Learning: Going beyond Euclidean data , 2016, IEEE Signal Process. Mag..

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Ilaria Pigazzini,et al.  Automatic detection of architectural bad smells through semantic representation of code , 2019, ECSA.

[17]  Francesca Arcelli Fontana,et al.  Arcan: A Tool for Architectural Smells Detection , 2017, 2017 IEEE International Conference on Software Architecture Workshops (ICSAW).

[18]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[19]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[20]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.