Automatic Classification of Software Artifacts in Open-Source Applications

With the increasing popularity of open-source software development, there is a tremendous growth of software artifacts that provide insight into how people build software. Researchers are always looking for large-scale and representative software artifacts to produce systematic and unbiased validation of novel and existing techniques. For example, in the domain of software requirements traceability, researchers often use software applications with multiple types of artifacts, such as requirements, system elements, verifications, or tasks to develop and evaluate their traceability analysis techniques. However, the manual identification of rich software artifacts is very labor-intensive. In this work, we first conduct a large-scale study to identify which types of software artifacts are produced by a wide variety of open-source projects at different levels of granularity. Then we propose an automated approach based on Machine Learning techniques to identify various types of software artifacts. Through a set of experiments, we report and compare the performance of these algorithms when applied to software artifacts.

[1]  Collin McMillan,et al.  On using machine learning to automatically classify software applications into domain categories , 2014, Empirical Software Engineering.

[2]  Juan Julián Merelo Guervós,et al.  Beyond source code: The importance of other artifacts in software development (a case study) , 2006, J. Syst. Softw..

[3]  Jane Huffman Hayes,et al.  Tracing requirements to defect reports: an application of information retrieval techniques , 2005, Innovations in Systems and Software Engineering.

[4]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[5]  Hung Hum,et al.  Is Naïve Bayes a Good Classifier for Document Classification , 2011 .

[6]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[7]  LiGuo Huang,et al.  Text Mining Support for Software Requirements: Traceability Assurance , 2011, 2011 44th Hawaii International Conference on System Sciences.

[8]  Gregg Rothermel,et al.  Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact , 2005, Empirical Software Engineering.

[9]  Kevin A Hallgren,et al.  Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. , 2012, Tutorials in quantitative methods for psychology.

[10]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[11]  L. Buydens,et al.  Facilitating the application of Support Vector Regression by using a universal Pearson VII function based kernel , 2006 .

[12]  Jane Cleland-Huang,et al.  Improving trace accuracy through data-driven configuration and composition of tracing features , 2013, ESEC/FSE 2013.

[13]  Martin Shepperd,et al.  Data Sets and Data Quality in Software Engineering: Eight Years On , 2016, PROMISE.

[14]  Daniel M. Germán,et al.  The Debsources Dataset: two decades of free and open source software , 2016, Empirical Software Engineering.

[15]  Jane Cleland-Huang,et al.  A machine learning approach for tracing regulatory codes to product specific requirements , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[16]  Jane Cleland-Huang,et al.  Detecting, Tracing, and Monitoring Architectural Tactics in Code , 2016, IEEE Transactions on Software Engineering.

[17]  Mehdi Mirakhorli,et al.  Datasets from Fifteen Years of Automated Requirements Traceability Research: Current State, Characteristics, and Quality , 2017, 2017 IEEE 25th International Requirements Engineering Conference (RE).

[18]  Jane Huffman Hayes,et al.  Application of Swarm Techniques to Requirements Engineering: Requirements Tracing , 2010, 2010 18th IEEE International Requirements Engineering Conference.

[19]  Eirini Kalliamvakou,et al.  An in-depth study of the promises and perils of mining GitHub , 2016, Empirical Software Engineering.

[20]  Genny Tortora,et al.  Can Information Retrieval Techniques Effectively Support Traceability Link Recovery? , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[21]  Peter Wiemer-Hastings,et al.  Latent semantic analysis , 2004, Annu. Rev. Inf. Sci. Technol..

[22]  K. R. Remya,et al.  Using weighted majority voting classifier combination for relation classification in biomedical texts , 2014, 2014 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT).

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[25]  Jane Cleland-Huang,et al.  Towards an intelligent domain-specific traceability solution , 2014, ASE.

[26]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[27]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[28]  Oussama Ben Khadra,et al.  Goal-centric traceability for managing non-functional requirements , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[29]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[30]  Mehdi Mirakhorli,et al.  Automated training-set creation for software architecture traceability problem , 2017, Empirical Software Engineering.

[31]  Qiang Ye,et al.  Sentiment classification of online reviews to travel destinations by supervised machine learning approaches , 2009, Expert Syst. Appl..

[32]  Suku Nair,et al.  A comparison of machine learning techniques for phishing detection , 2007, eCrime '07.

[33]  Gabriele Bavota,et al.  Machine Learning-Based Detection of Open Source License Exceptions , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[34]  Georgios Gousios,et al.  A dataset for pull-based development research , 2014, MSR 2014.

[35]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[36]  Audris Mockus,et al.  Patterns of folder use and project popularity: a case study of github repositories , 2014, ESEM '14.

[37]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[38]  Jane Huffman Hayes,et al.  Application of reinforcement learning to requirements engineering: requirements tracing , 2013, 2013 21st IEEE International Requirements Engineering Conference (RE).

[39]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[40]  Hridesh Rajan,et al.  Mining billions of AST nodes to study actual and potential usage of Java language features , 2014, ICSE.

[41]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[42]  Jane Cleland-Huang,et al.  Automated extraction and visualization of quality concerns from requirements specifications , 2014, 2014 IEEE 22nd International Requirements Engineering Conference (RE).

[43]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[44]  Elaine J. Weyuker,et al.  Comparing the effectiveness of several modeling methods for fault prediction , 2010, Empirical Software Engineering.

[45]  Nenad Medvidovic,et al.  A large-scale study of architectural evolution in open-source software systems , 2017, Empirical Software Engineering.

[46]  Meiyappan Nagappan,et al.  Curating GitHub for engineered software projects , 2017, Empirical Software Engineering.

[47]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[48]  Denys Poshyvanyk,et al.  Using Latent Dirichlet Allocation for automatic categorization of software , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[49]  Nenad Medvidovic,et al.  Obtaining ground-truth software architectures , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[50]  Alberto Bacchelli,et al.  Classifying Code Comments in Java Open-Source Software Systems , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[51]  Michael W. Godfrey,et al.  Evolution in open source software: a case study , 2000, Proceedings 2000 International Conference on Software Maintenance.

[52]  Gabriele Bavota,et al.  Using code ownership to improve IR-based Traceability Link Recovery , 2013, 2013 21st International Conference on Program Comprehension (ICPC).