A Quantitative Profile of a Community of Open Source Linux Developers

Open source software, or free software, has generated much interest and debate in the wake of a number of high-impact applications and systems produced under open source models for development and distribution. Despite the high degree of interest, little hard data exists to-date on the membership of collaborative open source communities and the evolutionary process of their repositories. This paper contributes a baseline quantitative study of one of the oldest continuous repositories for the Linux open source project (the UNC MetaLab Linux Archives), including demographic information on its broad community of developers. Our methodology is a close examination of collection statistics, including custom monitoring scripts on the server, as well as an analysis of the contents of user-generated metadata embedded within the Archives. User-generated metadata files in a format known as the Linux Software Map (LSM) are required when submitting open source software for inclusion in non-mirrored portions of the MetaLab Linux Archives. The over 4500 LSMs in the Archives then provide a demographic profile of contributors of LSM-accompanied software as well as other information on this broad subset of the Linux community. To explore repository evolution directly, an instrumented Linux Archives mirror was developed, and aggregate statistics on content changes seen over a month-long period are reported. In sum, our results quantify aspects of the global Linux development effort in dimensions that have not been documented before now, as well as providing a guide for more detailed future studies. Introduction Open source development communities have successfully created, distributed, and continued to evolve many important software projects---the GNU project’s utilities and libraries including the gcc compiler and Emacs editor, the Perl and Tcl languages, the Apache WWW server, and the Linux and FreeBSD operating system. Open source, or free software, means more than access to source code (see Appendix A), and there is not universal agreement on a single open-source development model. Nonetheless, the guiding principle for open source software is that, by sharing source code, developers 1 See http://metalab.unc.edu/osrt/ for related work by the Open Source Research Team. Dempsey, Weiss, Jones, and Greenberg 12/6/99 SILS Technical Report TR-1999-05 Page 2 cooperate under a model of rigorous peer-review and take advantage of “parallel debugging” that leads to innovation and rapid advancement in developing and evolving software products. Open-source licensing, moreover, ensures an open market in integration and support for these products downstream. Software production and distribution driven by the open source model thus has strong practical advantages as well as its strong appeal to those who, in Richard Stallman’s words, see open source software in a “social advantage, allowing users to cooperate, and an ethical advantage, respecting the user’s freedom. [3]” Advocates emphasizing the business reasons for adopting an open source model have engendered in recent years an on-going---and often acrimonious--debate over the ultimate impact of open source communities. Some have proposed that free software methods leveraging the Internet represent an alternative economic model for engendering and managing robust software that will dramatically reshape the multi-billion dollar commercial software industry. Skeptics meanwhile continue to challenge the idea that the technical and organizational approach represented by open-source development can really scale up in the coming years and produce the robust software required for large-scale mainstream computing [1]. The stakes in this debate are clearly quite high. A prime difficulty in understanding and drawing conclusions about open source collaborative development has been the sketchy information available on exactly who participates in open source development and how their software archives evolve. This lack of information is understandable given the distributed, organic process of collaborative development in open source communities. The contribution of this paper is a baseline quantitative study of a broad community of developers within the Linux open source effort, which, due to its influence and increasing user base, is widely regarded as a cornerstone project for large-scale open-source development. Our work characterizes a very large repository of Linux-related materials and analyzes information embedded within the collection on the nature of its contributors. Derived from a variety of collection meta-data statistics, the data and analysis here supports the assertions that Linux community is indeed very vibrant, geographically diverse, and engaged in a broadening the quantity and scope of the freely available Linux software and documentation. Background on Open Source Development The genesis of the open-source model for software development and distribution goes back to the earliest days of software in university environments. Open-source software is an alternative term for “free software”, which was popularized by the seminal Free Software Foundation, founded in 1984 by MIT researcher Richard Stallman. The Free Software Foundation is the parent organization for the GNU (GNU’s Not Unix) project. Stallman’s vision was to develop a free operating system, complete with standard software tools such as compilers, interpreters, text editors, mailers, and so forth, in order to recreate a community of cooperating hackers that he felt had been lost [3]. Under his direction, the Free Software Foundation popularized the term “free software” as explained in the now-classic distinction, free as in “free speech”, not “free beer”. That is, Dempsey, Weiss, Jones, and Greenberg 12/6/99 SILS Technical Report TR-1999-05 Page 3 free software may or may not be distributed with a monetary cost, but the knowledge that underlies the program, i.e., the source code, should be freely available in order to empower future innovation. Software source code is a form of scientific knowledge, and just as scientists publish so that other scientists can build on their results, computer scientists must publish their source code in order to foster continued innovation in computing. Unfortunately, the term “free software” has negative connotations for many in the commercial computing world, and the tone adopted by Stallman, the most prominent free software advocate for some time, was distinctly anti-business. In early 1997, a group of leaders in the free software community decided to address this problem head-on with a marketing campaign designed “argue for ‘free software' on pragmatic grounds of reliability, cost, and strategic business risk.” [4]. They were goaded to action largely by frustration over what they felt was the unrecognized potential of free software as a driver of innovation and the basis for the development of commercial-grade software, despite the successes of Apache, Linux, and other projects. An initial decision of the group, which would become the Open Source Initiative, was to choose the term “open source” for their campaign to avoid the baggage being carried by the term “free software”. A key component of Stallman’s effort in developing a successful free software organization was to formulate a licensing agreement that would prevent businesses from taking free software and using it in binary-only redistributions for commercial gain. Stallman developed the GNU General Public License, known as the GPL or “copyleft”, to address this issue. In subsequent years, other open-source efforts adopted variations on copyright statements designed to enable open-source works to thrive while not hampering the ability of developers to incorporate open-source work effectively. For its part, The Open Source Initiative adopted a set of criteria, titled “The Open Source Definition”, for open-source licensing. Based on an earlier document by Bruce Perens, the Open Source Definition explicitly mentions some example licenses that fit its criteria, including that of the GNU project (the GNU GPL), the Berkeley Unix Project (BSD), the X Consortium, and a few others. For reference, the Open Source Definition, Version 1.7, is reproduced in Appendix A. Linux: Open-Source Development on a Global Scale Internet connectivity has enabled the open-source notion of cooperative, peer-reviewed software development to be deployed on a global scale. Perhaps the most influential open-source project to-date has been and continues to be the Linux operating system. Linux began as a personal project of a graduate student in Finland, Linus Torvalds, in 1991. The Linux project now represents a mature operating system that runs on the popular hardware platforms. Linux is playing an increasingly significant role in the business plans of established computing companies, in university research labs, and in the development of a new set of companies focused on Linux support and integration issues. 2 It is interesting to note that the GNU Emacs FAQ (http://www.gnu.org/software/emacs/emacs-faq.text) dated February 1999 points out: “The real legal meaning of the GNU General Public License (copyleft) will only be known if and when a judge rules on its validity and scope. There has never been a copyright infringement case involving the GPL to set any precedents... ” Dempsey, Weiss, Jones, and Greenberg 12/6/99 SILS Technical Report TR-1999-05 Page 4 According to April 1999 statistics at the Internet Operating System Counter site [5], Linux is now the operating system at over 30% of Internet server sites. Linux has been estimated to have 10 percent of the server market share in the Unix market with growth trends suggesting it will dominate the Unix arena in a few years. The Linux Kernel Project continues to be led by Linus Torvalds himself, with a significant array of co-developers throughout the world. In addition, the Linux community of application-lev