Automated Localization for Unreproducible Builds

Reproducibility is the ability of recreating identical binaries under pre-defined build environments. Due to the need of quality assurance and the benefit of better detecting attacks against build environme nts, the practice of reproducible builds has gained popularity in many open-source software repositories such as Debian and Bitcoin. However, identifying the unreproducible issues remains a labour intensive and time consuming challenge, because of the lacking of information to guide the search and the diversity of the causes that may lead to the unreproducible binaries. In this paper we propose an automated framework called RepLoc to localize the problematic files for unreproducible builds. RepLoc features a query augmentation component that utilizes the information extracted from the build logs, and a heuristic rule-based filtering component that narrows the search scope. By integrating the two components with a weighted file ranking module, RepLoc is able to automatically produce a ranked list of files that are helpful in locating the problematic files for the unreproducible builds. We have implemented a prototype and conducted extensive experiments over 671 real-world unreproducible Debian packages in four different categories. By considering the topmost ranked file only, RepLoc achieves an accuracy rate of 47.09%. If we expand our examination to the top ten ranked files in the list produced by RepLoc, the accuracy rate becomes 79.28%. Considering that there are hundreds of source code, scripts, Makefiles, etc., in a package, RepLoc significantly reduces the scope of localizing problematic files. Moreover, with the help of RepLoc, we successfully identified and fixed six new unreproducible packages from Debian and Guix.

[1]  David Lo,et al.  Version history, similar report, and structure: putting them together for improved bug localization , 2014, ICPC 2014.

[2]  Josep Torrellas,et al.  Replay debugging: Leveraging record and replay for program debugging , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[3]  Christopher Vendome,et al.  Automatically Discovering, Reporting and Reproducing Android Application Crashes , 2016, 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[4]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[5]  Martin Monperrus,et al.  Crash reproduction via test case mutation: let existing test cases help , 2015, ESEC/SIGSOFT FSE.

[6]  Davide Di Ruscio,et al.  Simulating upgrades of complex systems: The case of Free and Open Source Software , 2014, Inf. Softw. Technol..

[7]  David A. Wheeler,et al.  Countering trusting trust through diverse double-compiling , 2005, 21st Annual Computer Security Applications Conference (ACSAC'05).

[8]  Hang Li Learning to Rank for Information Retrieval and Natural Language Processing , 2011, Synthesis Lectures on Human Language Technologies.

[9]  Letha H. Etzkorn,et al.  Bug localization using latent Dirichlet allocation , 2010, Inf. Softw. Technol..

[10]  David Lo,et al.  Potential biases in bug localization: do they matter? , 2014, ASE.

[11]  Alessandro Orso,et al.  Evaluating the usefulness of IR-based fault localization techniques , 2015, ISSTA.

[12]  Kay Römer,et al.  Evaluation of diverse compiling for software-fault detection , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[13]  Robert O'Callahan,et al.  Engineering Record and Replay for Deployability , 2017, USENIX Annual Technical Conference.

[14]  Ken-ichi Matsumoto,et al.  Using Co-change Histories to Improve Bug Localization Performance , 2013, 2013 14th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[15]  Razvan C. Bunescu,et al.  Learning to rank relevant files for bug reports using domain knowledge , 2014, SIGSOFT FSE.

[16]  Marc Roper,et al.  Using Bug Report Similarity to Enhance Bug Localisation , 2012, 2012 19th Working Conference on Reverse Engineering.

[17]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[18]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[19]  Sarfraz Khurshid,et al.  Improving bug localization using structured information retrieval , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[20]  Henry Medeiros,et al.  Comparing Incremental Latent Semantic Analysis Algorithms for Efficient Retrieval from Software Libraries for Bug Localization , 2015, SOEN.

[21]  Ken Thompson,et al.  Reflections on trusting trust , 1984, CACM.

[22]  Mohammad Mannan,et al.  Challenges and implications of verifiable builds for security-critical open-source software , 2014, ACSAC '14.

[23]  Anh Tuan Nguyen,et al.  Combining Deep Learning with Information Retrieval to Localize Buggy Files for Bug Reports (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[24]  Olivier Richard,et al.  Reconstructable Software Appliances with Kameleon , 2015, OPSR.

[25]  David Lo,et al.  Compositional Vector Space Models for Improved Bug Localization , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[26]  Lu Zhang,et al.  Boosting Bug-Report-Oriented Fault Localization with Segmentation and Stack-Trace Analysis , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[27]  Avinash C. Kak,et al.  Incorporating version histories in Information Retrieval based bug localization , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).