论文信息 - The Open-Closed Principle of Modern Machine Learning Frameworks

The Open-Closed Principle of Modern Machine Learning Frameworks

Recent advances in computing technologies and the availability of huge volumes of data have sparked a new machine learning (ML) revolution, where almost every day a new headline touts the demise of human experts by ML models on some task. Open source software development is rumoured to play a significant role in this revolution, with both academics and large corporations such as Google and Microsoft releasing their ML frameworks under an open source license. This paper takes a step back to examine and understand the role of open source development in modern ML, by examining the growth of the open source ML ecosystem on GitHub, its actors, and the adoption of frameworks over time. By mining LinkedIn and Google Scholar profiles, we also examine driving factors behind this growth (paid vs. voluntary contributors), as well as the major players who promote its democratization (companies vs. communities), and the composition of ML development teams (engineers vs. scientists). According to the technology adoption lifecycle, we find that ML is in between the stages of early adoption and early majority. Furthermore, companies are the main drivers behind open source ML, while the majority of development teams are hybrid teams comprising both engineers and professional scientists. The latter correspond to scientists employed by a company, and by far represent the most active profiles in the development of ML applications, which reflects the importance of a scientific background for the development of ML frameworks to complement coding skills. The large influence of cloud computing companies on the development of open source ML frameworks raises the risk of vendor lock-in. These frameworks, while open source, could be optimized for specific commercial cloud offerings.

[1] Daniel M. Germán,et al. Management of community contributions , 2013, Empirical Software Engineering.

[2] Ruslan Salakhutdinov,et al. Learning Deep Generative Models , 2009 .

[3] Peter Norvig,et al. Google's hybrid approach to research , 2012, Commun. ACM.

[4] J. West,et al. Challenges of Open Innovation: The Paradox of Firm Investment in Open-Source Software , 2006 .

[5] R. Yin. Case Study Research: Design and Methods , 1984 .

[6] Florence March,et al. 2016 , 2016, Affair of the Heart.

[7] Arthur L. Samuel,et al. Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[8] Enrique Orduña-Malea,et al. Methods for estimating the size of Google Scholar , 2014, Scientometrics.

[9] Dirk Homscheid,et al. Between organization and community: investigating turnover intention factors of firm-sponsored open source software developers , 2016, WebSci.

[10] Audris Mockus,et al. Inflow and Retention in OSS Communities with Commercial Involvement , 2016, ACM Trans. Softw. Eng. Methodol..

[11] Sebastian Spaeth,et al. The open source software phenomenon: Characteristics that promote research , 2007, J. Strateg. Inf. Syst..

[12] Carl E. Rasmussen,et al. The Need for Open Source Software in Machine Learning , 2007, J. Mach. Learn. Res..

[13] Galit Shmueli,et al. Predictive Analytics in Information Systems Research , 2010, MIS Q..

[14] George M. Beal,et al. THE DIFFUSION PROCESS , 1956 .

[15] Dirk Riehle,et al. Paid vs. Volunteer Work in Open Source , 2014, 2014 47th Hawaii International Conference on System Sciences.

[16] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[17] Christian Bird,et al. Who? Where? What? Examining distributed development in two large open source projects , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[18] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[19] S. M. García,et al. 2014: , 2020, A Party for Lazarus.