Detecting and Characterizing Bots that Commit Code

Background: Some developer activity traditionally performed manually, such as making code commits, opening, managing, or closing issues is increasingly subject to automation in many OSS projects. Specifically, such activity is often performed by tools that react to events or run at specific times. We refer to such automation tools as bots and, in many software mining scenarios related to developer productivity or code quality, it is desirable to identify bots in order to separate their actions from actions of individuals. Aim: Find an automated way of identifying bots and code committed by these bots, and to characterize the types of bots based on their activity patterns. Method and Result: We propose BIMAN, a systematic approach to detect bots using author names, commit messages, files modified by the commit, and projects associated with the commits. For our test data, the value for AUC-ROC was 0.9. We also characterized these bots based on the time patterns of their code commits and the types of files modified, and found that they primarily work with documentation files and web pages, and these files are most prevalent in HTML and JavaScript ecosystems. We have compiled a shareable dataset containing detailed information about 461 bots we found (all of which have more than 1000 commits) and 13,762,430 commits they created.

[1]  Margaret-Anne D. Storey,et al.  Disrupting developer productivity one bot at a time , 2016, SIGSOFT FSE.

[2]  Daniela E. Damian,et al.  Predicting build failures using social network analysis on developer communication , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[3]  Marco Aurélio Gerosa,et al.  The Power of Bots , 2018, Proc. ACM Hum. Comput. Interact..

[4]  Audris Mockus,et al.  Modeling Relationship between Post-Release Faults and Usage in Mobile Software , 2018, PROMISE.

[5]  Jordi Cabot,et al.  A Systematic Mapping Study of Software Development With GitHub , 2017, IEEE Access.

[6]  MendesEmilia,et al.  Taxonomies in software engineering , 2017 .

[7]  Sven Helmer,et al.  Measuring the Structural Similarity of Semistructured Documents Using Entropy , 2007, VLDB.

[8]  Audris Mockus,et al.  Which Pull Requests Get Accepted and Why? A study of popular NPM Packages , 2020, ArXiv.

[9]  Luciana Benotti,et al.  Engaging high school students using chatbots , 2014, ITiCSE '14.

[10]  Steven Gianvecchio,et al.  Measurement and Classification of Humans and Bots in Internet Chat , 2008, USENIX Security Symposium.

[11]  Ikram El Asri,et al.  Knowledge Flows Within Open Source Software Projects: A Social Network Perspective , 2016, UNet.

[12]  Riccardo Scandariato,et al.  Current and Future Bots in Software Development , 2019, 2019 IEEE/ACM 1st International Workshop on Bots in Software Engineering (BotSE).

[13]  Stéphane Frénot,et al.  LogOS: An Automatic Logging Framework for Service-Oriented Architectures , 2012, 2012 38th Euromicro Conference on Software Engineering and Advanced Applications.

[14]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[15]  Sebastiano Vigna,et al.  BUbiNG: massive crawling for the masses , 2014, WWW.

[16]  Scott Payne,et al.  Supervised Machine Learning Bot Detection Techniques to Identify Social Twitter Bots , 2018 .

[17]  N. T. Thomas,et al.  An e-business chatbot using AIML and LSA , 2016, 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[18]  Margaret-Anne D. Storey,et al.  Software Bots , 2017, IEEE Software.

[19]  K. Kersting,et al.  Relational Sequence Alignment , 2006 .

[20]  Ling Li,et al.  Optimal Group Size for Software Change Tasks: A Social Information Foraging Perspective , 2016, IEEE Transactions on Cybernetics.

[21]  Martin Monperrus,et al.  Explainable Software Bot Contributions: Case Study of Automated Bug Fixes , 2019, 2019 IEEE/ACM 1st International Workshop on Bots in Software Engineering (BotSE).

[22]  Craig A. Knoblock,et al.  Efficient Graph-Based Document Similarity , 2016, ESWC.

[23]  Audris Mockus,et al.  World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[24]  Audris Mockus,et al.  Are Software Dependency Supply Chain Metrics Useful in Predicting Change of Popularity of NPM Packages? , 2018, PROMISE.

[25]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[26]  Ivan Beschastnikh,et al.  Accelerating Software Engineering Research Adoption with Analysis Bots , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering: New Ideas and Emerging Technologies Results Track (ICSE-NIER).

[27]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[28]  Susan Bull,et al.  Conversational Agents in E-Learning , 2008, SGAI Conf..

[29]  Thomas Fritz,et al.  Context-Aware Conversational Developer Assistants , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[30]  Audris Mockus,et al.  An Exploratory Study of Bot Commits , 2020, ICSE.

[31]  Audris Mockus,et al.  A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits , 2020, MSR.

[32]  Jonathan Grudin,et al.  Human-computer integration , 2016, Interactions.

[33]  Shwetak N. Patel,et al.  Convey: Exploring the Use of a Context View for Chatbots , 2018, CHI.

[34]  David W. McDonald,et al.  Dissecting a Social Botnet: Growth, Content and Influence in Twitter , 2015, CSCW.

[35]  Yang Li,et al.  Sentiment analysis of commit comments in GitHub: an empirical study , 2014, MSR 2014.

[36]  Juan de Lara,et al.  The rise of the (modelling) bots: Towards assisted modelling via social networks , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[37]  David Buttler,et al.  A Short Survey of Document Structure Similarity Algorithms , 2004, International Conference on Internet Computing.

[38]  Audris Mockus,et al.  ALFAA: Active Learning Fingerprint Based Anti-Aliasing for Correcting Developer Identity Errors in Version Control Data , 2019, ArXiv.

[39]  Audris Mockus,et al.  Deriving a usage-independent software quality metric , 2020, Empirical Software Engineering.

[40]  Audris Mockus,et al.  Patterns of Effort Contribution and Demand and User Classification based on Participation Patterns in NPM Ecosystem , 2019, PROMISE.

[41]  Audris Mockus,et al.  A Methodology for Measuring FLOSS Ecosystems , 2019, Towards Engineering Free/Libre Open Source Software (FLOSS) Ecosystems for Impact and Sustainability.

[42]  Audris Mockus,et al.  Succession: Measuring transfer of code and developer productivity , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[43]  Carlene Lebeuf,et al.  A taxonomy of software bots: towards a deeper understanding of software bot characteristics , 2018 .

[44]  Emilia Mendes,et al.  Taxonomies in software engineering: A Systematic mapping study and a revised taxonomy development method , 2017, Inf. Softw. Technol..

[45]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[46]  Audris Mockus,et al.  Developer fluency: achieving true mastery in software projects , 2010, FSE '10.

[47]  James D. Herbsleb,et al.  Social coding in GitHub: transparency and collaboration in an open software repository , 2012, CSCW.