Identifying bot activity in GitHub pull request and issue comments

Development bots are used on Github to automate repetitive activities. Such bots communicate with human actors via issue comments and pull request comments. Identifying such bot comments allows to prevent bias in socio-technical studies related to software development. To automate their identification, we propose a classification model based on natural language processing. Starting from a balanced ground-truth dataset of 19,282 PR and issue comments, we encode the comments as vectors using a combination of the bag of words and TF-IDF techniques. We train a range of binary classifiers to predict the type of comment (human or bot) based on this vector representation. A multinomial Naive Bayes classifier provides the best results. Its performance on a test set containing 50% of the data achieves an average precision, recall, and F1 score of 0.88. Although the model shows a promising result on the pull request and issue comments, further work is required to generalize the model on other types of activities, like commit messages and code reviews.

[1]  Matias Martinez,et al.  Repairnator patches programs automatically , 2019, Ubiquity.

[2]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[3]  Bruno Mendes de Souza,et al.  The Power of Bots: Understanding Bots in OSS Projects , 2018 .

[4]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[5]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[7]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[8]  Damien Legay,et al.  Bot or not?: Detecting bots in GitHub pull request activity based on comment similarity , 2020, ICSE.

[9]  Audris Mockus,et al.  Detecting and Characterizing Bots that Commit Code , 2020, 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR).

[10]  Arto Luoma Thomas Bayes , 1999, The Lancet.

[11]  R. Brereton,et al.  Support vector machines for classification and regression. , 2010, The Analyst.

[12]  Tom Mens,et al.  A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments , 2020, J. Syst. Softw..

[13]  David Lo,et al.  Network Structure of Social Coding in GitHub , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[16]  Chris Parnin,et al.  Can automated pull requests encourage software developers to upgrade out-of-date dependencies? , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[17]  Karen Kafadar,et al.  Letter-Value Plots: Boxplots for Large Data , 2017 .

[18]  Marvin Wyrich,et al.  Towards an Autonomous Bot for Automatic Source Code Refactoring , 2019, 2019 IEEE/ACM 1st International Workshop on Bots in Software Engineering (BotSE).

[19]  Francisco Gomes de Oliveira Neto,et al.  An empirical study of bots in software development: characteristics and challenges from a practitioner’s perspective , 2020, ESEC/SIGSOFT FSE.

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.