Profiling Software Developers with Process Mining and N-Gram Language Models

Context : Profiling developers is challenging since many factors, such as their skills, experience, development environment and behaviors, may influence a detailed analysis and the delivery of coherent interpretations. Objective : We aim at profiling software developers by mining their software development process. To do so, we performed a controlled experiment where, in the realm of a Python programming contest, a group of developers had the same well-defined set of requirements specifications and a well-defined sprint schedule. Events were collected from the PyCharm IDE, and from the Mooshak automatic jury where subjects checked-in their code. Method : We used n-gram language models and text mining to characterize developers’ profiles, and process mining algorithms to discover their overall workflows and extract the correspondent metrics for further evaluation. Results : Findings show that we can clearly characterize with a coherent rationale most developers, and distinguish the top performers from the ones with more challenging behaviors. This approach may lead ultimately to the creation of a catalog of software development process smells. Conclusions : The profile of a developer provides a software project manager a clue for the selection of appropriate tasks he/she should be assigned. Email addresses: jcppc@iscte-iul.pt (João Caldeira ), fba@iscte-iul.pt (Fernando Brito e Abreu ), jcardoso@dei.uc.pt (Jorge Cardoso ), ricardo.ribeiro@iscte-iul.pt (Ricardo Ribeiro ), werner@cos.ufrj.br (Claudia Werner ) Preprint submitted to Journal of Systems and Software January 19, 2021 ar X iv :2 10 1. 06 73 3v 1 [ cs .S E ] 1 7 Ja n 20 21 With the increasing usage of low and no-code platforms, where coding is automatically generated from an upper abstraction layer, mining developer’s actions in the development platforms is a promising approach to early detect not only behaviors but also assess project complexity and model effort.

[1]  Georgios Gousios,et al.  Developer Testing in the IDE: Patterns, Beliefs, and Behavior , 2019, IEEE Trans. Software Eng..

[2]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[3]  Wil M. P. van der Aalst,et al.  Workflow mining: discovering process models from event logs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[4]  Fabio Nelli Python Data Analytics , 2015, Apress.

[5]  Michele Lanza,et al.  I know what you did last summer: an investigation of how developers spend their time , 2015, ICPC '15.

[6]  C. Humby,et al.  Process Mining: Data science in Action , 2014 .

[7]  Wil M. P. van der Aalst,et al.  Process variant comparison: Using event logs to detect differences in behavior and business rules , 2017, Inf. Syst..

[8]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[9]  Per Runeson,et al.  Using Students as Experiment Subjects – An Analysis on Graduate and Freshmen Student Data , 2003 .

[10]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[11]  Inge van de Weerd,et al.  Understanding users' behavior with software operation data mining , 2014, Comput. Hum. Behav..

[12]  Xinli Yang,et al.  What Security Questions Do Developers Ask? A Large-Scale Study of Stack Overflow Posts , 2016, Journal of Computer Science and Technology.

[13]  Adam A. Porter,et al.  Comparing Detection Methods for Software Requirements Inspections: A Replicated Experiment , 1995, IEEE Trans. Software Eng..

[14]  F. K. Shuptrine,et al.  On the Validity of Using Students as Subjects in Consumer Behavior Investigations , 1975 .

[15]  Tim Menzies,et al.  Easy over hard: a case study on deep learning , 2017, ESEC/SIGSOFT FSE.

[16]  José Paulo Leal,et al.  Mooshak: a Web‐based multi‐site programming contest system , 2003, Softw. Pract. Exp..

[17]  Alexander Serebrenik,et al.  Process Mining Software Repositories , 2011, 2011 15th European Conference on Software Maintenance and Reengineering.

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Ahmed E. Hassan,et al.  A survey on the use of topic models when mining software repositories , 2015, Empirical Software Engineering.

[20]  Goldie Gabrani,et al.  Python for Data Analytics, Scientific and Technical Applications , 2019, 2019 Amity International Conference on Artificial Intelligence (AICAI).

[21]  Ricardo Ribeiro,et al.  Stepwise API usage assistance using n-gram language models , 2017, J. Syst. Softw..

[22]  Alfonso Fuggetta,et al.  Software process , 2014, FOSE.

[23]  Zhenchang Xing,et al.  Measuring Program Comprehension: A Large-Scale Field Study with Professionals , 2018, IEEE Transactions on Software Engineering.

[24]  van der Wmp Wil Aalst,et al.  Discovering deviating cases and process variants using trace clustering , 2015 .

[25]  M. Narasimha Murty,et al.  On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations , 2010, PAKDD.

[26]  Patrice Bellot,et al.  Accurate and effective latent concept modeling for ad hoc information retrieval , 2014, Document Numérique.

[27]  Lori Pollock,et al.  Predicting Future Developer Behavior in the IDE Using Topic Models , 2018, IEEE Transactions on Software Engineering.

[28]  Josep Carmona,et al.  A Fresh Look at Precision in Process Conformance , 2010, BPM.

[29]  Josep Carmona,et al.  Business Process Variant Analysis Based on Mutual Fingerprints of Event Logs , 2020, CAiSE.

[30]  Ranjith Engu,et al.  Are Students Good Proxies for Studying Professional : A Systematic Literature Review , 2012 .

[31]  Tim Menzies,et al.  The Art and Science of Analyzing Software Data; Quantitative Methods , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[32]  Alfredo Bolt,et al.  Finding Process Variants in Event Logs , 2017 .

[33]  David Lo,et al.  Duplicate bug report detection with a combination of information retrieval and topic modeling , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[34]  Tim Menzies,et al.  Tuning for Software Analytics: is it Really Necessary? , 2016, Inf. Softw. Technol..

[35]  Sheng Tang,et al.  A density-based method for adaptive LDA model selection , 2009, Neurocomputing.

[36]  Adam A. Porter,et al.  Comparing Detection Methods For Software Requirements Inspections: A Replication Using Professional Subjects , 1998, Empirical Software Engineering.

[37]  Claes Wohlin,et al.  Using Students as Subjects—A Comparative Study of Students and Professionals in Lead-Time Impact Assessment , 2000, Empirical Software Engineering.

[38]  James D. Herbsleb,et al.  Building a socio-technical theory of coordination: why and how (outstanding research award) , 2016, SIGSOFT FSE.

[39]  F. Blum Metrics in process discovery , 2015 .

[40]  William Remus,et al.  Using students as subjects in experiments on decision support systems , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume III: Decision Support and Knowledge Based Systems Track.

[41]  Yang Liu,et al.  What’s Spain’s Paris? Mining analogical libraries from Q&A discussions , 2018, Empirical Software Engineering.

[42]  Alessandro Berti,et al.  A Novel Token-Based Replay Technique to Speed Up Conformance Checking and Process Enhancement , 2020, Trans. Petri Nets Other Model. Concurr..

[43]  G. Fitzgerald,et al.  'I. , 2019, Australian journal of primary health.

[44]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[45]  Ben M. Enis,et al.  Students as Subjects in Consumer Behavior Experiments , 1972 .

[46]  Heng Li,et al.  Studying software logging using topic models , 2018, Empirical Software Engineering.

[47]  Ashish Sureka,et al.  Process mining software repositories from student projects in an undergraduate software engineering course , 2014, ICSE Companion.

[48]  Tim Menzies,et al.  What is wrong with topic modeling? And how to fix it using search-based software engineering , 2016, Inf. Softw. Technol..

[49]  Zhenchang Xing,et al.  The structure and dynamics of knowledge network in domain-specific Q&A sites: a case study of stack overflow , 2017, Empirical Software Engineering.

[50]  Boudewijn F. van Dongen,et al.  Quality Dimensions in Process Discovery: The Importance of Fitness, Precision, Generalization and Simplicity , 2014, Int. J. Cooperative Inf. Syst..

[51]  Johannes Schneider,et al.  Mining Sequences of Developer Interactions in Visual Studio for Usage Smells , 2017, IEEE Transactions on Software Engineering.

[52]  Alexander L. Wolf,et al.  Discovering models of software processes from event-based data , 1998, TSEM.

[53]  Patrik Berander,et al.  Using students as subjects in requirements prioritization , 2004, Proceedings. 2004 International Symposium on Empirical Software Engineering, 2004. ISESE '04..

[54]  Claes Wohlin,et al.  Using students as subjects - an empirical evaluation , 2008, ESEM '08.

[55]  Christian Bird,et al.  The Art and Science of Analyzing Software Data , 2015, ICSE 2015.

[56]  Lianping Chen,et al.  A systematic review of evaluation of variability management approaches in software product lines , 2011, Inf. Softw. Technol..

[57]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[58]  Di Chen,et al.  How to “DODGE” Complex Software Analytics , 2019, IEEE Transactions on Software Engineering.

[59]  João Caldeira,et al.  Assessing Software Development Teams' Efficiency using Process Mining , 2019, 2019 International Conference on Process Mining (ICPM).

[60]  Gerald M. Hampton,et al.  Students As Subjects in International Behavioral Studies , 1979 .