Massive Multi-agent Data-Driven Simulations of the GitHub Ecosystem

Simulating and predicting planetary-scale techno-social systems poses heavy computational and modeling challenges. The DARPA SocialSim program set the challenge to model the evolution of GitHub, a large collaborative software-development ecosystem, using massive multi-agent simulations. We describe our best performing models and our agent-based simulation framework, which we are currently extending to allow simulating other planetary-scale techno-social systems. The challenge problem measured participant’s ability, given 30 months of meta-data on user activity on GitHub, to predict the next months’ activity as measured by a broad range of metrics applied to ground truth, using agent-based simulation. The challenge required scaling to a simulation of roughly 3 million agents producing a combined 30 million actions, acting on 6 million repositories with commodity hardware. It was also important to use the data optimally to predict the agent’s next moves. We describe the agent framework and the data analysis employed by one of the winning teams in the challenge. Six different agent models were tested based on a variety of machine learning and statistical methods. While no single method proved the most accurate on every metric, the broadly most successful sampled from a stationary probability distribution of actions and repositories for each agent. Two reasons for the success of these agents were their use of a distinct characterization of each agent, and that GitHub users change their behavior relatively slowly.

[1]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[2]  David Sislák,et al.  Distributed Platform for Large-Scale Agent-Based Simulations , 2009, AGS.

[3]  Gennaro Cordasco,et al.  Distributed Load Balancing for Parallel Agent-Based Simulations , 2011, 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing.

[4]  Kagan Tumer,et al.  Distributed agent-based air traffic flow management , 2007, AAMAS '07.

[5]  David Lo,et al.  Popularity, Interoperability, and Impact of Programming Languages in 100,000 Open Source Projects , 2013, 2013 IEEE 37th Annual Computer Software and Applications Conference.

[6]  Michael J. North,et al.  Parallel agent-based simulation with Repast for High Performance Computing , 2013, Simul..

[7]  Jim Blythe,et al.  FARM: Architecture for Distributed Agent-Based Social Simulations , 2018, MMAS.

[8]  Marco Tulio Valente,et al.  Predicting the Popularity of GitHub Repositories , 2016, PROMISE.

[9]  Didier Sornette,et al.  How Much Is the Whole Really More than the Sum of Its Parts? 1 ⊞ 1 = 2.5: Superlinear Productivity in Collective Group Actions , 2014, PloS one.

[10]  Eleni Stroulia,et al.  Co-evolution of project documentation and popularity within github , 2014, MSR 2014.

[11]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[12]  David Lo,et al.  Network Structure of Social Coding in GitHub , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[13]  J. Blythe,et al.  A dual-process cognitive model for testing resilient control systems , 2012, 2012 5th International Symposium on Resilient Control Systems.

[14]  Antonio Lima,et al.  Personalized routing for multitudes in smart cities , 2015, EPJ Data Science.

[15]  Antonio Lima,et al.  Coding Together at Scale: GitHub as a Collaborative Social Network , 2014, ICWSM.

[16]  Audris Mockus,et al.  Patterns of folder use and project popularity: a case study of github repositories , 2014, ESEM '14.

[17]  James D. Herbsleb,et al.  Social coding in GitHub: transparency and collaboration in an open software repository , 2012, CSCW.

[18]  James P. Bagrow,et al.  Understanding the group dynamics and success of teams , 2014, Royal Society Open Science.

[19]  Santo Fortunato,et al.  Scale-free network growth by ranking. , 2006, Physical review letters.

[20]  Kristina Lerman,et al.  Predicting and explaining behavioral data with structured feature space decomposition , 2018, EPJ Data Science.

[21]  Ronaldo Menezes,et al.  The effect of recency to human mobility , 2015, EPJ Data Science.

[22]  Itsuki Noda Multi-agent Social Simulation for Social Service Design , 2018, MMAS.

[23]  Roi Blanco,et al.  Probabilistic static pruning of inverted files , 2010, TOIS.

[24]  Georgios Gousios,et al.  GHTorrent: Github's data from a firehose , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).