Automatic Detection and Analysis of Technical Debts in Peer-Review Documentation of R Packages

Technical debt (TD) is a metaphor for code-related problems that arise as a result of prioritizing speedy delivery over perfect code. Given that the reduction of TDs can have long-term positive impact in the software engineering life-cycle (SDLC), TDs are studied extensively in the literature. However, very few of the existing research focused on the technical debts of R programming language despite its popularity and usage. Recent research by Codabux et al. [21] finds that R packages can have 10 diverse TD types analyzing peer-review documentation. However, the findings are based on the manual analysis of a small sample of R package review comments. In this paper, we develop a suite of Machine Learning (ML) classifiers to detect the 10 TDs automatically. The best performing classifier is based on the deep ML model BERT, which achieves F1-scores of 0.71 0.91. We then apply the trained BERT models on all available peer-review issue comments from two platforms, rOpenSci and BioConductor (13.5K review comments coming from a total of 1297 R packages). We conduct an empirical study on the prevalence and evolution of 10 TDs in the two R platforms. We discovered documentation debt is the most prevalent among all types of TD, and it is also expanding rapidly. We also find that R packages of generic platform (i.e. rOpenSci) are more prone to TD compared to domain-specific platform (i.e. BioConductor). Our empirical study findings can guide future improvements opportunities in R package documentation. Our ML models can be used to automatically monitor the prevalence and evolution of TDs in R package documentation.

[1]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[2]  Junaed Younus Khan,et al.  Automatic Detection of Five API Documentation Smells: Practitioners’ Perspectives , 2021, 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).

[3]  Foutse Khomh,et al.  Automatic API Usage Scenario Documentation from Technical Q&A Sites , 2021, ACM Trans. Softw. Eng. Methodol..

[4]  Lutz Prechelt,et al.  Automatic early stopping using cross validation: quantifying the criteria , 1998, Neural Networks.

[5]  Hierarchical Classification , 2019, CIRP Encyclopedia of Production Engineering.

[6]  Chu-Ren Huang,et al.  Lexical Data Augmentation for Text Classification in Deep Learning , 2020, Canadian Conference on AI.

[7]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[8]  Jan Bosch,et al.  Investigating Architectural Technical Debt accumulation and refactoring over time: A multiple-case study , 2015, Inf. Softw. Technol..

[9]  Susan T. Dumais,et al.  Using SVMs for Text Categorization , 2016 .

[10]  Rami Bahsoon,et al.  Database Design Debts through Examining Schema Evolution , 2016, 2016 IEEE 8th International Workshop on Managing Technical Debt (MTD).

[11]  Patrick Debois,et al.  Agile Infrastructure and Operations: How Infra-gile are You? , 2008, Agile 2008 Conference.

[12]  Nikolaos Tsantalis,et al.  Using Natural Language Processing to Automatically Detect Self-Admitted Technical Debt , 2017, IEEE Transactions on Software Engineering.

[13]  Hideaki Hata,et al.  Identifying Design and Requirement Self-Admitted Technical Debt Using N-gram IDF , 2018, 2018 9th International Workshop on Empirical Software Engineering in Practice (IWESEP).

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  Radu Marinescu,et al.  Assessing technical debt by identifying design flaws in software systems , 2012, IBM J. Res. Dev..

[16]  Lorenzo Rosasco,et al.  Are Loss Functions All the Same? , 2004, Neural Computation.

[17]  Rok Blagus,et al.  SMOTE for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[18]  Zadia Codabux,et al.  An empirical assessment of technical debt practices in industry , 2017, J. Softw. Evol. Process..

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Lutz Prechelt,et al.  Early Stopping - But When? , 2012, Neural Networks: Tricks of the Trade.

[21]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[22]  Emad Shihab,et al.  An Exploratory Study on Self-Admitted Technical Debt , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[23]  César Ferri,et al.  Improving Performance of Multiclass Classification by Inducing Class Hierarchies , 2017, ICCS.

[24]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[25]  Alexander Chatzigeorgiou,et al.  Technical debt forecasting: An empirical study on open-source repositories , 2020, J. Syst. Softw..

[26]  Robert L. Nord,et al.  Technical Debt: From Metaphor to Theory and Practice , 2012, IEEE Software.

[27]  Xijin Tang,et al.  Text classification based on multi-word with support vector machine , 2008, Knowl. Based Syst..

[28]  Yuanfang Cai,et al.  Comparing four approaches for technical debt identification , 2014, Software Quality Journal.

[29]  Jan Bosch,et al.  Technical Debt Cripples Software Developer Productivity: A Longitudinal Study on Developers’ Daily Software Development Work , 2018, 2018 IEEE/ACM International Conference on Technical Debt (TechDebt).

[30]  Peng Liang,et al.  A systematic mapping study on technical debt and its management , 2015, J. Syst. Softw..

[31]  Carolyn B. Seaman,et al.  Measuring and Monitoring Technical Debt , 2011, Adv. Comput..

[32]  Philippe Kruchten,et al.  What is social debt in software engineering? , 2013, 2013 6th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE).

[33]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[34]  Gias Uddin,et al.  Mining API Aspects in API Reviews , 2017 .

[35]  Eduardo C. Garrido-Merch'an,et al.  Comparing BERT against traditional machine learning text classification , 2020, ArXiv.

[36]  Apostolos Ampatzoglou,et al.  Experience With Managing Technical Debt in Scientific Software Development Using the EXA2PRO Framework , 2021, IEEE Access.

[37]  Martin P. Robillard,et al.  How API Documentation Fails , 2015, IEEE Software.

[38]  Anindya Iqbal,et al.  How do developers discuss and support new programming languages in technical Q&A site? An empirical study of Go, Swift, and Rust in Stack Overflow , 2021, Inf. Softw. Technol..

[39]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[40]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[41]  Alexander Serebrenik,et al.  An Empirical Study on the Removal of Self-Admitted Technical Debt , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[42]  Zadia Codabux,et al.  Technical Debt in the Peer-Review Documentation of R Packages: a rOpenSci Case Study , 2021, 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR).

[43]  Robert L. Nord,et al.  Technical debt: towards a crisper definition report on the 4th international workshop on managing technical debt , 2013, SOEN.

[44]  David Lo,et al.  Automating Change-Level Self-Admitted Technical Debt Determination , 2019, IEEE Transactions on Software Engineering.

[45]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[46]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[47]  Forrest Shull,et al.  A case study on effectively identifying technical debt , 2013, EASE '13.

[48]  Eric Allman,et al.  Managing Technical Debt , 2012, ACM Queue.

[49]  Mary Popeck,et al.  Got Technical Debt? Surfacing Elusive Technical Debt in Issue Trackers , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[50]  Ward Cunningham,et al.  The WyCash portfolio management system , 1992, OOPSLA '92.

[51]  Manoel G. Mendonça,et al.  A tertiary study on technical debt: Types, management strategies, research trends, and base information for practitioners , 2018, Inf. Softw. Technol..

[52]  Foutse Khomh,et al.  Understanding How and Why Developers Seek and Analyze API-Related Opinions , 2019, IEEE Transactions on Software Engineering.

[53]  Frank Buschmann,et al.  To Pay or Not to Pay Technical Debt , 2011, IEEE Software.

[54]  Pavel Brazdil,et al.  Comparison of SVM and Some Older Classification Algorithms in Text Classification Tasks , 2006, IFIP AI.

[55]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[56]  Tim Menzies,et al.  Identifying Self-Admitted Technical Debts With Jitterbug: A Two-Step Approach , 2020, IEEE Transactions on Software Engineering.

[57]  Philippe Kruchten,et al.  Architectural Technical Debt: A Grounded Theory , 2020, ECSA.

[58]  Peng Liang,et al.  Architectural Technical Debt Identification Based on Architecture Decisions and Change Scenarios , 2015, 2015 12th Working IEEE/IFIP Conference on Software Architecture.

[59]  Leevi Rantala,et al.  Towards Better Technical Debt Detection with NLP and Machine Learning Methods , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[60]  Richard T. Vidgen,et al.  An exploration of technical debt , 2013, J. Syst. Softw..

[61]  Lidong Bing,et al.  Exploiting BERT for End-to-End Aspect-based Sentiment Analysis , 2019, EMNLP.

[62]  Foutse Khomh,et al.  Automatic summarization of API reviews , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[63]  Marco Tulio Valente,et al.  Beyond the Code: Mining Self-Admitted Technical Debt in Issue Tracker Systems , 2020, 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR).

[64]  Yuanfang Cai,et al.  Identifying and Quantifying Architectural Debt , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[65]  Yasutaka Kamei,et al.  A survey of self-admitted technical debt , 2019, J. Syst. Softw..

[66]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[67]  Xuemin Wang,et al.  A Survey of Text Data Augmentation , 2020, 2020 International Conference on Computer Communication and Network Security (CCNS).

[68]  Forrest Shull,et al.  Investigating the impact of design debt on software quality , 2011, MTD '11.

[69]  Robert L. Nord,et al.  Reducing Friction in Software Development , 2016, IEEE Software.

[70]  Yi Sun,et al.  Some Code Smells Have a Significant but Small Effect on Faults , 2014, TSEM.

[71]  Kazi Zakia Sultana,et al.  Examining the Relationship of Code and Architectural Smells with Software Vulnerabilities , 2020, 2020 27th Asia-Pacific Software Engineering Conference (APSEC).

[72]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[73]  Apostolos Ampatzoglou,et al.  The financial aspect of managing technical debt: A systematic literature review , 2015, Inf. Softw. Technol..

[74]  Richard T. Vidgen,et al.  A Consolidated Understanding of Technical debt , 2012, ECIS.

[75]  Markku Oivo,et al.  Analyzing the concept of technical debt in the context of agile software development: A systematic literature review , 2017, Inf. Softw. Technol..

[76]  Hernán Astudillo,et al.  Hearing the Voice of Software Practitioners on Causes, Effects, and Practices to Deal with Documentation Debt , 2020, REFSQ.

[77]  Carolyn B. Seaman,et al.  A Balancing Act: What Software Practitioners Have to Say about Technical Debt , 2012, IEEE Softw..

[78]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[79]  Jennifer Pérez,et al.  Guiding Flexibility Investment in Agile Architecting , 2014, 2014 47th Hawaii International Conference on System Sciences.

[80]  Ipek Ozkaya,et al.  Managing Technical Debt in Software Engineering (Dagstuhl Seminar 16162) , 2016, Dagstuhl Reports.

[81]  Philippe Kruchten,et al.  Building and evaluating a theory of architectural technical debt in software-intensive systems , 2021, J. Syst. Softw..

[82]  Forrest Shull,et al.  Identification and management of technical debt: A systematic mapping study , 2016, Inf. Softw. Technol..

[83]  Kelly Blincoe,et al.  Embracing Technical Debt, from a Startup Company Perspective , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[84]  Elaine Venson,et al.  A Systematic Literature Review of Technical Debt Prioritization , 2020, 2020 IEEE/ACM International Conference on Technical Debt (TechDebt).

[85]  Hakan Erdogmus Comparative evaluation of software development strategies based on Net Present Value , 1999 .

[86]  David Lo,et al.  SATD Detector: A Text-Mining-Based Self-Admitted Technical Debt Detection Tool , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion).

[87]  Scott Chamberlain,et al.  Building Software, Building Community: Lessons from the rOpenSci Project , 2014 .