Mining the Technical Roles of GitHub Users

Abstract Context:Modern software development demands high levels of technical specialization. These conditions make IT companies focus on creating cross-functional teams, such as frontend, backend, and mobile developers. In this context, the success of software projects is highly influenced by the expertise of these teams in each field. Objective:In this paper, we investigate machine-learning based approaches to automatically identify the technical roles of open source developers. Method:For this, we first build a ground truth with 2284 developers labeled in six different roles: backend, frontend, full-stack, mobile, devops, and data science. Then, we build three different machine-learning models used to identify these roles. Results:These models presented competitive results for precision (0.88) and AUC (0.89) when identifying all six roles. Moreover, our results show that programming-languages are the most relevant features to predict the investigated roles. Conclusion:The approach proposed in this paper can assist companies during their hiring process, such as by recommending developers with the expertise required by job positions.

[1]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[2]  Iftekhar Ahmed,et al.  What Makes a Good Developer? An Empirical Study of Developers' Technical and Social Competencies , 2018, 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[3]  Zhenchang Xing,et al.  Who Will Leave the Company?: A Large-Scale Industry Study of Developer Turnover by Mining Monthly Work Report , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[4]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[5]  Esteban Walter Gonzalez Clua,et al.  Niche vs. breadth: Calculating expertise over time through a fine-grained analysis , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[6]  Sven Apel,et al.  Classifying Developers into Core and Peripheral: An Empirical Study on Count and Network Metrics , 2016, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[7]  Saso Dzeroski,et al.  An extensive experimental comparison of methods for multi-label learning , 2012, Pattern Recognit..

[8]  Massimiliano Di Penta,et al.  Automatically Classifying Posts Into Question Categories on Stack Overflow , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[9]  Michalis Faloutsos,et al.  Determining Developers' Expertise and Role: A Graph Hierarchy-Based Approach , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[10]  Grigorios Tsoumakas,et al.  Random K-labelsets for Multilabel Classification , 2022 .

[11]  Jens Grabowski,et al.  Hidden Markov Models for the Prediction of Developer Involvement Dynamics and Workload , 2016, PROMISE.

[12]  Beijun Shen,et al.  CPDScorer: Modeling and Evaluating Developer Programming Ability across Software Communities , 2016, SEKE.

[13]  Marco Tulio Valente,et al.  Identifying Experts in Software Libraries and Frameworks Among GitHub Users , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[14]  Liang Chen,et al.  SCSMiner: mining social coding sites for software developer recommendation with relevance propagation , 2018, World Wide Web.

[15]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[16]  Marco Tulio Valente,et al.  Identifying unmaintained projects in github , 2018, ESEM.

[17]  Juan José del Coz,et al.  Binary relevance efficacy for multilabel classification , 2012, Progress in Artificial Intelligence.

[18]  David Lo,et al.  What are the characteristics of high-rated apps? A case study on free Android Applications , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[19]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[20]  Leif Singer,et al.  Assessing Technical Candidates on the Social Web , 2013, IEEE Software.

[21]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[22]  Stephan Diehl,et al.  Towards a theory of software development expertise , 2018, ESEC/SIGSOFT FSE.

[23]  Christoph Treude,et al.  Mutual assessment in the social programmer ecosystem: an empirical investigation of developer profile aggregators , 2013, CSCW.

[24]  H. James Nelson,et al.  Mining for Computing Jobs , 2010, IEEE Software.

[25]  Xavier Blanc,et al.  Automatic extraction of developer expertise , 2014, EASE '14.

[26]  Eyke Hüllermeier,et al.  On label dependence and loss minimization in multi-label classification , 2012, Machine Learning.

[27]  Yi Zhang,et al.  Classifying Software Changes: Clean or Buggy? , 2008, IEEE Transactions on Software Engineering.

[28]  Francisco Charte,et al.  Multilabel Classification: Problem Analysis, Metrics and Techniques , 2016 .

[29]  Marco Tulio Valente,et al.  What Skills do IT Companies look for in New Developers? A Study with Stack Overflow Jobs , 2020, Inf. Softw. Technol..

[30]  Tom DeMarco,et al.  Peopleware: Productive Projects and Teams , 1987 .

[31]  Jesús M. González-Barahona,et al.  Evolution of the core team of developers in libre software projects , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[32]  Tom Fawcett,et al.  Data science for business , 2013 .

[33]  Hailong Sun,et al.  Recommending crowdsourced software developers in consideration of skill improvement , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[34]  Chris Parnin,et al.  The Tech-Talk Balance: What Technical Interviewers Expect from Technical Candidates , 2017, 2017 IEEE/ACM 10th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE).

[35]  Marco Tulio Valente,et al.  Measuring and analyzing code authorship in 1 + 118 open source projects , 2019, Sci. Comput. Program..

[36]  Bernd Fischer,et al.  CVExplorer: Identifying candidate developers by mining and exploring their open source contributions , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[37]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[38]  Tim Menzies,et al.  Better cross company defect prediction , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[39]  Forrest Shull,et al.  Local versus Global Lessons for Defect Prediction and Effort Estimation , 2013, IEEE Transactions on Software Engineering.

[40]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[41]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[42]  Kirk P. Arnett,et al.  IT skills in a tough job market , 2005, Commun. ACM.

[43]  Xavier Blanc,et al.  Find your library experts , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[44]  Atul Gupta,et al.  Discovery of technical expertise from open source code repositories , 2013, WWW.

[45]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[46]  Marco Tulio Valente,et al.  Assessing Code Authorship: The Case of the Linux Kernel , 2017, OSS.

[47]  Chris Parnin,et al.  Hiring is Broken: What Do Developers Say About Technical Interviews? , 2019, 2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[48]  Gayle Laakmann McDowell Cracking the Coding Interview: 189 Programming Questions and Solutions , 2015 .

[49]  Michael Gertz,et al.  Expertise identification and visualization from CVS , 2008, MSR '08.

[50]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[51]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[52]  Saimir Bala,et al.  Resource Classification from Version Control System Logs , 2016, 2016 IEEE 20th International Enterprise Distributed Object Computing Workshop (EDOCW).

[53]  Eleni Constantinou,et al.  Identifying Developers' Expertise in Social Coding Platforms , 2016, 2016 42th Euromicro Conference on Software Engineering and Advanced Applications (SEAA).

[54]  Hubert Cecotti,et al.  Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition , 2015, Pattern Recognit. Lett..

[55]  Jennifer Marlow,et al.  Activity traces and signals in software developer recruitment and hiring , 2013, CSCW.

[56]  Laura A. Dabbish,et al.  Hiring in the Global Stage: Profiles of Online Contributions , 2016, 2016 IEEE 11th International Conference on Global Software Engineering (ICGSE).

[57]  Gail C. Murphy,et al.  Who should fix this bug? , 2006, ICSE.

[58]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[59]  Fred P. Brooks,et al.  The Mythical Man-Month , 1975, Reliable Software.

[60]  Stan Matwin,et al.  Mining the maintenance history of a legacy software system , 2003, International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings..

[61]  Srini Ramaswamy,et al.  Mining CVS Repositories to Understand Open-Source Project Developer Roles , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[62]  Rohit Saxena,et al.  I Know What You Coded Last Summer: Mining Candidate Expertise from GitHub Repositories , 2017, CSCW Companion.

[63]  James D. Herbsleb,et al.  Impression formation in online peer production: activity traces and personal profiles in github , 2013, CSCW.

[64]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[65]  Georgios Gousios,et al.  Matching GitHub Developer Profiles to Job Advertisements , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[66]  Kouichi Kishida,et al.  Toward an understanding of the motivation of open source software developers , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[67]  Eleni Constantinou,et al.  Developers Expertise and Roles on Software Technologies , 2016, 2016 23rd Asia-Pacific Software Engineering Conference (APSEC).