Bot or not?: Detecting bots in GitHub pull request activity based on comment similarity

Many empirical studies focus on socio-technical activity in social coding platforms such as GitHub, for example to study the onboarding, abandonment, productivity and collaboration among team members. Such studies face the difficulty that GitHub activity can also be generated automatically by bots of a different nature. It therefore becomes imperative to distinguish such bots from human users. We propose an automated approach to detect bots in GitHub pull request (PR) activity. Relying on the assumption that bots contain repetitive message patterns in their PR comments, we analyse the similarity between multiple messages from the same GitHub identity, using a clustering method that combines the Jaccard and Levenshtein distance. We empirically evaluate our approach by analysing 20,090 PR comments of 250 users and 42 bots in 1,262 GitHub repositories. Our results show that the method is able to clearly separate bots from human users.

[1]  David Lo,et al.  Network Structure of Social Coding in GitHub , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[2]  Emad Shihab,et al.  MSRBot: Using bots to answer questions from software repositories , 2019, Empirical Software Engineering.

[3]  Emilio Ferrara,et al.  Deep Neural Networks for Bot Detection , 2018, Inf. Sci..

[4]  Chanchal Kumar Roy,et al.  An insight into the pull requests of GitHub , 2014, MSR 2014.

[5]  Bruno Mendes de Souza,et al.  The Power of Bots: Understanding Bots in OSS Projects , 2018 .

[6]  Premkumar T. Devanbu,et al.  Gender and Tenure Diversity in GitHub Teams , 2015, CHI.

[7]  Di Chen,et al.  Replication Can Improve Prior Results: A GitHub Study of Pull Request Acceptance , 2019, 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC).

[8]  Alexander Serebrenik,et al.  Discovering community patterns in open-source: a systematic approach and its evaluation , 2018, Empirical Software Engineering.

[9]  Thomas Fritz,et al.  Software developers' perceptions of productivity , 2014, SIGSOFT FSE.

[10]  Eleni Constantinou,et al.  An empirical comparison of developer retention in the RubyGems and npm software ecosystems , 2017, Innovations in Systems and Software Engineering.

[11]  Alexander Serebrenik,et al.  Who's who in Gnome: Using LSA to merge software repository identities , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[12]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[13]  Tom Mens,et al.  A comparison of identity merge algorithms for software repositories , 2013, Sci. Comput. Program..

[14]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[15]  James D. Herbsleb,et al.  Let's talk about it: evaluating contributions through discussion in GitHub , 2014, SIGSOFT FSE.

[16]  Marvin Wyrich,et al.  Towards an Autonomous Bot for Automatic Source Code Refactoring , 2019, 2019 IEEE/ACM 1st International Workshop on Bots in Software Engineering (BotSE).

[17]  Marco Aurélio Gerosa,et al.  Should I Stale or Should I Close? An Analysis of a Bot That Closes Abandoned Issues and Pull Requests , 2019, 2019 IEEE/ACM 1st International Workshop on Bots in Software Engineering (BotSE).

[18]  Christoph Treude,et al.  Who is Who in the Mailing List? Comparing Six Disambiguation Heuristics to Identify Multiple Addresses of a Participant , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[19]  Chao Yang,et al.  CATS: Characterizing automation of Twitter spammers , 2013, 2013 Fifth International Conference on Communication Systems and Networks (COMSNETS).

[20]  Margaret-Anne D. Storey,et al.  Software Bots , 2017, IEEE Software.

[21]  Hans-Peter Kriegel,et al.  DBSCAN Revisited, Revisited , 2017, ACM Trans. Database Syst..

[22]  Yang Liu,et al.  ATOM: Commit Message Generation Based on Abstract Syntax Tree and Hybrid Ranking , 2019, ArXiv.

[23]  Matias Martinez,et al.  Repairnator patches programs automatically , 2019, Ubiquity.

[24]  Premkumar T. Devanbu,et al.  Developer onboarding in GitHub: the role of prior social links and language experience , 2015, ESEC/SIGSOFT FSE.

[25]  Ehud Sharlin,et al.  BuildBot: Robotic Monitoring of Agile Software Development Teams , 2007, RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.

[26]  Andrew Nesbitt,et al.  Libraries.io Open Source Repository and Dependency Metadata , 2017 .

[27]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[28]  Chris Parnin,et al.  Sorry to Bother You: Designing Bots for Effective Recommendations , 2019, 2019 IEEE/ACM 1st International Workshop on Bots in Software Engineering (BotSE).

[29]  Riccardo Scandariato,et al.  Current and Future Bots in Software Development , 2019, 2019 IEEE/ACM 1st International Workshop on Bots in Software Engineering (BotSE).

[30]  Daniela E. Damian,et al.  The promises and perils of mining GitHub , 2009, MSR 2014.

[31]  Aaron Halfaker,et al.  Bot Detection in Wikidata Using Behavioral and Other Informal Cues , 2018, Proc. ACM Hum. Comput. Interact..