Crowdsourcing Identification of License Violations

Free and open source software (FOSS) has created a large pool of source codes that can be easily copied to create new applications. However, a copy should preserve copyright notice and license of the original file unless the license explicitly permits such a change. Through software evolution, it is challenging to keep original licenses or choose proper licenses. As a result, there are many potential license violations. Despite the fact that violations can have high impact on protecting copyright, identification of violations is highly complex. It relies on manual inspections by experts. However, such inspection cannot be scaled up with open source software released daily worldwide. To make this process scalable, we propose the following two methods: use machine-based algorithms to narrow down the potential violations; and guide non-experts to manually inspect violations. Using the first method, we found 219 projects (76.6%) with potential violations. Using the second method, we show that the accuracy of crowds is comparable to that of experts. Our techniques might help developers identify potential violations, understand the causes, and resolve these violations.

[1]  Seung-won Hwang,et al.  CosTriage: A Cost-Aware Triage Algorithm for Bug Reporting Systems , 2011, AAAI.

[2]  Robert Gobeille,et al.  The FOSSology project , 2008, MSR '08.

[3]  Walt Scacchi,et al.  Intellectual Property Rights Requirements for Heterogeneously-Licensed Systems , 2009, 2009 17th IEEE International Requirements Engineering Conference.

[4]  Bonita Bryant Does your project have a copyright problem? A decision-making guide for librarians , 1997 .

[5]  Angela Lozano A methodology to assess the impact of source code flaws in changeability, and its application to clones , 2008, 2008 IEEE International Conference on Software Maintenance.

[6]  Daniel M. Germán,et al.  Code siblings: Technical and legal implications of copying code between applications , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[7]  Katsuro Inoue,et al.  An Investigation into the Impact of Software Licenses on Copy-and-paste Reuse among OSS Projects , 2011, 2011 18th Working Conference on Reverse Engineering.

[8]  T. R. Madanmohan Open Source Reuse in Commercial Firms Using Open Source Components Raises Many Issues, from Requirements Negotiation to Product Selection and Integration. a Recent Study of Projects Using Open Source Revealed Component Selection Criteria, Best Practices, and Other Related Issues , 2022 .

[9]  Daniel M. Germán,et al.  Understanding and Auditing the Licensing of Open Source Software Distributions , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[10]  Christof Ebert,et al.  Using open source software in product development: a primer , 2004, IEEE Software.

[11]  Joachim Henkel,et al.  Code Reuse in Open Source Software Development: Quantitative Evidence, Drivers, and Impediments , 2010, J. Assoc. Inf. Syst..

[12]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[13]  Michael S. Bernstein,et al.  Analytic Methods for Optimizing Realtime Crowdsourcing , 2012, ArXiv.

[14]  Michael W. Godfrey,et al.  Using origin analysis to detect merging and splitting of source code entities , 2005, IEEE Transactions on Software Engineering.

[15]  Daniel M. Germán,et al.  An exploratory study of the evolution of software licensing , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[16]  Rahul De',et al.  Notice of Violation of IEEE Publication PrinciplesOpen source reuse in commercial firms , 2004, IEEE Software.

[17]  Michael W. Godfrey,et al.  “Cloning considered harmful” considered harmful: patterns of cloning in software , 2008, Empirical Software Engineering.

[18]  Seung-won Hwang,et al.  Enriching Documents with Examples: A Corpus Mining Approach , 2013, TOIS.

[19]  Walt Scacchi,et al.  Heterogeneously-Licensed System Requirements, Acquisition and Governance , 2009, 2009 Second International Workshop on Requirements Engineering and Law.

[20]  Walt Scacchi,et al.  The Role of Software Licenses in Open Architecture Ecosystems , 2009, IWSECO@ICSR.

[21]  Joachim Henkel,et al.  License risks from ad hoc reuse of code from the internet , 2011, Commun. ACM.

[22]  Rainer Koschke,et al.  An Assessment of Type-3 Clones as Detected by State-of-the-Art Tools , 2009, 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation.

[23]  Jens Krinke,et al.  Is Cloned Code More Stable than Non-cloned Code? , 2008, 2008 Eighth IEEE International Working Conference on Source Code Analysis and Manipulation.

[24]  Miryung Kim,et al.  An empirical study of code clone genealogies , 2005, ESEC/FSE-13.

[25]  Seung-won Hwang,et al.  Adding Examples into Java Documents , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[26]  Lerina Aversano,et al.  An empirical study on the maintenance of source code clones , 2010, Empirical Software Engineering.

[27]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[28]  Nancy J. Mertzel Copying 0.03 percent of software code base not ‘de minimis’ , 2008 .

[29]  Seung-won Hwang,et al.  Integrating code search into the development session , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[30]  Greg Rosenberg A look into the interaction design of the new Yahoo! mail...: and the pros and cons of AJAX , 2007, INTR.

[31]  Seung-won Hwang,et al.  Towards an Intelligent Code Search Engine , 2010, AAAI.

[32]  Z. Popovic,et al.  Increased Diels-Alderase activity through backbone remodeling guided by Foldit players , 2012, Nature Biotechnology.

[33]  Seung-won Hwang,et al.  Hybrid entity clustering using crowds and data , 2013, The VLDB Journal.

[34]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[35]  Mitch Bayersdorfer Managing a project with open source components , 2007, INTR.

[36]  Pietro Perona,et al.  Online crowdsourcing: Rating annotators and obtaining cost-effective labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[37]  서정연,et al.  Journal of Computing Science and Engineering(JCSE)의 국제화 작업 , 2010 .

[38]  Seung-won Hwang,et al.  Surfacing code in the dark: an instant clone search approach , 2013, Knowledge and Information Systems.

[39]  Katsuro Inoue,et al.  A sentence-matching method for automatic license identification of source code files , 2010, ASE.

[40]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.

[41]  Lin Luo,et al.  A code provenance management tool for ip-aware software development , 2008, ICSE Companion '08.

[42]  Jens Krinke,et al.  A Study of Consistent and Inconsistent Changes to Code Clones , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[43]  Cristina V. Lopes,et al.  File cloning in open source Java projects: The good, the bad, and the ugly , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[44]  Seung-won Hwang,et al.  Instant code clone search , 2010, FSE '10.

[45]  Yue Jia,et al.  Cloning and copying between GNOME projects , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[46]  Bashar Nuseibeh,et al.  Evaluating the Harmfulness of Cloning: A Change Based Experiment , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).