Nalanda: A Socio-Technical Graph for Building Software Analytics Tools at Enterprise Scale

Software development is information-dense knowledge work that requires collaboration with other developers and awareness of artifacts such as work items, pull requests, and files. With the speed of development increasing, information overload is a challenge for people developing and maintaining these systems. In this paper, we build a large scale socio-technical graph to address challenges of information overload and discovery, with an initial focus on artifacts central to the software development and delivery process. The Nalanda graph is an enterprise scale graph with data from 6,500 repositories, with 37,410,706 nodes and 128,745,590 edges. On top of this, we built software analytics applications including a newsfeed named MyNalanda, and based on organic growth alone, it has Daily Active Users (DAU) of 290 and Monthly Active Users (MAU) of 590. A preliminary user study shows that 74% of developers and engineering managers surveyed are favorable toward continued use of the platform for information discovery. This work provides a view into a new large-scale socio-technical graph and the technical choices made for this approach, the implications for information discovery and overload among developers and managers, and the implications of future development on the Nalanda graph.

[1]  Jez Humble,et al.  2019 Accelerate State of DevOps Report , 2019 .

[2]  Thomas Zimmermann,et al.  Information needs for software development analytics , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[3]  Christoph Treude,et al.  Mutual assessment in the social programmer ecosystem: an empirical investigation of developer profile aggregators , 2013, CSCW.

[4]  Andrew Begel,et al.  Codebook: discovering and exploiting relationships in software repositories , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[5]  Thomas Fritz,et al.  Using information fragments to answer the questions developers ask , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[6]  Chris Parnin,et al.  Here We Go Again: Why Is It Difficult for Developers to Learn Another Programming Language? , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[7]  Georgios Gousios,et al.  GHTorrent: Github's data from a firehose , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[8]  Kirstie Hawkey,et al.  Guidelines for designing IT security management tools , 2008, CHiMiT '08.

[9]  Yixin Chen,et al.  Link Prediction Based on Graph Neural Networks , 2018, NeurIPS.

[10]  Andrew Begel,et al.  Analyze this! 145 questions for data scientists in software engineering , 2013, ICSE.

[11]  Muhammad Ali Babar,et al.  Key factors for adopting inner source , 2014, ACM Trans. Softw. Eng. Methodol..

[12]  Arie van Deursen,et al.  Nudge: Accelerating Overdue Pull Requests toward Completion , 2020, ACM Trans. Softw. Eng. Methodol..

[13]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[14]  Christoph Treude,et al.  How Modern News Aggregators Help Development Communities Shape and Share Knowledge , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[15]  Anita Sarma,et al.  Tesseract: Interactive visual exploration of socio-technical relationships in software development , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[16]  Arie van Deursen,et al.  FEVER: An approach to analyze feature-oriented changes and artefact co-evolution in highly configurable systems , 2018, Empirical Software Engineering.

[17]  Janice Singer,et al.  Hipikat: a project memory for software development , 2005, IEEE Transactions on Software Engineering.

[18]  Xuequn Wang,et al.  The Integrated User Satisfaction Model: Assessing Information Quality and System Quality as Second-order Constructs in System Administration , 2016, Commun. Assoc. Inf. Syst..

[19]  Gail C. Murphy,et al.  Asking and Answering Questions during a Programming Change Task , 2008, IEEE Transactions on Software Engineering.

[20]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[21]  Arie van Deursen,et al.  Questions for data scientists in software engineering: a replication , 2020, ESEC/SIGSOFT FSE.

[22]  Brendan Murphy,et al.  CODEMINE: Building a Software Development Data Analytics Platform at Microsoft , 2013, IEEE Software.

[23]  Josh Levenberg,et al.  Why Google stores billions of lines of code in a single repository , 2016, Commun. ACM.

[24]  Andrew Begel,et al.  Keeping up with your friends: function Foo, library Bar.DLL, and work item 24 , 2010, Web2SE '10.

[25]  Robert DeLine,et al.  Information Needs in Collocated Software Development Teams , 2007, 29th International Conference on Software Engineering (ICSE'07).

[26]  Audris Mockus,et al.  World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[27]  Michalis Faloutsos,et al.  Graph-based analysis and prediction for software evolution , 2012, 2012 34th International Conference on Software Engineering (ICSE).