The Ultimate Debian Database: Consolidating bazaar metadata for Quality Assurance and data mining

FLOSS distributions like RedHat and Ubuntu require a lot more complex infrastructures than most other FLOSS projects. In the case of community-driven distributions like Debian, the development of such an infrastructure is often not very organized, leading to new data sources being added in an impromptu manner while hackers set up new services that gain acceptance in the community. Mixing and matching data is then harder than should be, albeit being badly needed for Quality Assurance and data mining. Massive refactoring and integration is not a viable solution either, due to the constraints imposed by the bazaar development model. This paper presents the Ultimate Debian Database (UDD),1 which is the countermeasure adopted by the Debian project to the above “data hell”. UDD gathers data from various data sources into a single, central SQL database, turning Quality Assurance needs that could not be easily implemented before into simple SQL queries. The paper also discusses the customs that have contributed to the data hell, the lessons learnt while designing UDD, and its applications and potentialities for data mining on FLOSS distributions.

[1]  Audris Mockus Amassing and indexing a large sample of version control systems: Towards the census of public source code history , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[2]  Jesús M. González-Barahona,et al.  Mining large software compilations over time: another perspective of software evolution , 2006, MSR '06.

[3]  Timos K. Sellis,et al.  Data Warehouse Configuration , 1997, VLDB.

[4]  Jonathan I. Maletic,et al.  Journal of Software Maintenance and Evolution: Research and Practice Survey a Survey and Taxonomy of Approaches for Mining Software Repositories in the Context of Software Evolution , 2022 .

[5]  I MaleticJonathan,et al.  A survey and taxonomy of approaches for mining software repositories in the context of software evolution , 2007 .

[6]  Daniel Izquierdo-Cortazar,et al.  FLOSSMetrics: Free/Libre/Open Source Software Metrics , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[7]  Roberto Di Cosmo,et al.  Strong dependencies between software components , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[8]  Martin Michlmayr,et al.  USENIX Association Proceedings of the FREENIX Track : 2004 , 2004 .

[9]  Roberto Di Cosmo,et al.  Package upgrades in FOSS distributions: details and challenges , 2008, HotSWUp '08.

[10]  Eric S. Raymond,et al.  The cathedral and the bazaar - musings on Linux and Open Source by an accidental revolutionary , 2001 .

[11]  Dirk Riehle,et al.  The Total Growth of Open Source , 2008, OSS.

[12]  Daniel M. Germán,et al.  Macro-level software evolution: a case study of a large software compilation , 2009, Empirical Software Engineering.

[13]  Jesús M. González-Barahona,et al.  Managing Libre Software Distributions under a Product Line Approach , 2008, 2008 32nd Annual IEEE International Computer Software and Applications Conference.

[14]  T Maillart,et al.  Empirical tests of Zipf's law mechanism in open source Linux distribution. , 2008, Physical review letters.

[15]  Kevin Crowston,et al.  The Perils and Pitfalls of Mining SourceForge , 2004, MSR.

[16]  Kevin Crowston,et al.  FLOSSmole: A Collaborative Repository for FLOSS Research Data and Analyses , 2006, Int. J. Inf. Technol. Web Eng..

[17]  Lucas Nussbaum Rebuilding debian using distributed computing , 2009, CLADE '09.

[18]  Nathan LaBelle,et al.  Inter-Package Dependency Networks in Open-Source Software , 2004, ArXiv.

[19]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[20]  Martin Michlmayr Quality and the Reliance on Individuals in Free Software Projects , 2011 .