Quality Evaluation of Modern Code Reviews Through Intelligent Biometric Program Comprehension

Code review is an essential practice in software engineering to spot code defects in the early stages of software development. Modern code reviews (e.g., acceptance or rejection of pull requests with Git) have become less formal than classic Fagan's inspections, lightweight, and more reliant on individuals (i.e., reviewers). However, reviewers may encounter mentally demanding challenges during the code review, such as code comprehension difficulties or distractions that might affect the code review quality. This work proposes a novel approach that evaluates the quality of code reviews in terms of bug-finding effectiveness and provides the reviewers with a clear message of whether the review should be repeated, indicating the code regions that may not have been well-reviewed. The proposed approach utilizes biometric information collected from the reviewer during the review process using non-intrusive biofeedback devices (e.g., smartwatches). Biometric measures such as Heart Rate Variability (HRV) and task-evoked pupillary response are captured as a surrogate of the cognitive state of the reviewer (e.g., mental workload) and inexpensive desktop eye-trackers compatible with the software development settings. This work uses Artificial Intelligence techniques to predict the cognitive load from the extracted biomarkers and classify each code region according to a set of features. The final evaluation considers various factors such as code complexity, time of the code review, the experience level of the reviewer, and other factors. Our experimental results show the approach could predict the review quality with 87.77%±4.65 accuracy and a Spearman correlation coefficient of 0.85 (p-value < 0.001) between the predicted and the actual review performance. This evaluation validates the cognitive load measurement using electroencephalography (EEG) signals as ground truth for the HRV and pupil signals.

[1]  M. Castelo‐Branco,et al.  iReview: an Intelligent Code Review Evaluation Tool using Biofeedback , 2021, IEEE International Symposium on Software Reliability Engineering.

[2]  Isabel Catarina Duarte,et al.  Reading and Calculation Neural Systems and Their Weighted Adaptive Use for Programming Skills , 2021, Neural plasticity.

[3]  C. Teixeira,et al.  Can EEG Be Adopted as a Neuroscience Reference for Assessing Software Programmers’ Cognitive Load? , 2021, Sensors.

[4]  René Riedl,et al.  Brain and autonomic nervous system activity measurement in software engineering: A systematic literature review , 2021, J. Syst. Softw..

[5]  Felipe Ebert,et al.  An exploratory study on confusion in code reviews , 2020, Empirical Software Engineering.

[6]  Zibin Zheng,et al.  Code Review Knowledge Perception: Fusing Multi-Features for Salient-Class Location , 2020, IEEE Transactions on Software Engineering.

[7]  T. Y. Abay,et al.  Heart Rate Variability (HRV) and Pulse Rate Variability (PRV) for the Assessment of Autonomic Responses , 2020, Frontiers in Physiology.

[8]  Magdalena Andrzejewska,et al.  Examining Students’ Intrinsic Cognitive Load During Program Comprehension – An Eye Tracking Approach , 2020, AIED.

[9]  Thomas Leich,et al.  A Look into Programmers’ Heads , 2020, IEEE Transactions on Software Engineering.

[10]  Tatsuya Suzuki,et al.  Evaluating driver cognitive distraction by eye tracking: From simulator to driving , 2020 .

[11]  É. Grivel,et al.  Alterations in heart-brain interactions under mild stress during a cognitive task are reflected in entropy of heart rate dynamics , 2019, Scientific Reports.

[12]  João Durães,et al.  Spotting Problematic Code Lines using Nonintrusive Programmers' Biofeedback , 2019, 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE).

[13]  Bernd Resch,et al.  Detecting Moments of Stress from Measurements of Wearable Physiological Sensors , 2019, Sensors.

[14]  Domenico Cotroneo,et al.  How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform , 2019, ESEC/SIGSOFT FSE.

[15]  Henrique Madeira,et al.  Pupillography as Indicator of Programmers' Mental Effort and Cognitive Overload , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[16]  Henrique Madeira,et al.  Biofeedback Augmented Software Engineering: Monitoring of Programmers' Mental Effort , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER).

[17]  Alexander Serebrenik,et al.  Beyond the Code Itself: How Programmers Really Look at Pull Requests , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS).

[18]  J. Deharo,et al.  Mental Workload Alters Heart Rate Variability, Lowering Non-linear Dynamics , 2019, Front. Physiol..

[19]  Leena Jain,et al.  Designing The Code Snippets for Experiments on Code Comprehension of Different Software Constructs , 2019, International Journal of Computer Sciences and Engineering.

[20]  Nicole Novielli,et al.  A Replication Study on Code Comprehension and Expertise using Lightweight Biometric Sensors , 2019, 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC).

[21]  I-A Sandu,et al.  New approach of the Customer Defects per Lines of Code metric in Automotive SW Development applications , 2018, Journal of Physics: Conference Series.

[22]  Andrew Begel,et al.  Eye movements in code review , 2018, EMIP@ETRA.

[23]  Isabel Catarina Duarte,et al.  The role of the insula in intuitive expert bug detection in computer code: an fMRI study , 2018, Brain Imaging and Behavior.

[24]  Yongqiang Lyu,et al.  Evaluating Photoplethysmogram as a Real-Time Cognitive Load Assessment during Game Playing , 2018, Int. J. Hum. Comput. Interact..

[25]  Danial Hooshyar,et al.  Mining biometric data to predict programmer expertise and task difficulty , 2017, Cluster Computing.

[26]  Randal S. Olson,et al.  Relief-Based Feature Selection: Introduction and Review , 2017, J. Biomed. Informatics.

[27]  Sithu D. Sudarsan,et al.  Recognizing eye tracking traits for source code review , 2017, 2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation (ETFA).

[28]  Ana Aguiar,et al.  Heart rate variability metrics for fine-grained stress level assessment , 2017, Comput. Methods Programs Biomed..

[29]  Nedhal A. Al-Saiyd Source code comprehension analysis in software maintenance , 2017, 2017 2nd International Conference on Computer and Communication Systems (ICCCS).

[30]  Westley Weimer,et al.  Decoding the Representation of Code in the Brain: An fMRI Study of Code Review and Expertise , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[31]  Luke Church,et al.  Modern Code Review: A Case Study at Google , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[32]  Henrique Madeira,et al.  WAP: Understanding the Brain at Software Debugging , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[33]  R. Couceiro,et al.  Can PPG be used for HRV analysis? , 2016, 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[34]  Igor Crk,et al.  Assessing the contribution of the individual alpha frequency (IAF) in an EEG-based study of program comprehension , 2016, 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[35]  Michael W. Godfrey,et al.  Code Review Quality: How Developers See It , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[36]  Thomas Fritz,et al.  Using (Bio)Metrics to Predict Code Quality Online , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[37]  Albrecht Schmidt,et al.  A Model Relating Pupil Diameter to Mental Workload and Lighting Conditions , 2016, CHI.

[38]  Michael W. Godfrey,et al.  Investigating code review quality: Do people and participation matter? , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[39]  Bin Liu,et al.  The impact of software process consistency on residual defects , 2015, J. Softw. Evol. Process..

[40]  Mickaël Causse,et al.  Frequency analysis of a task-evoked pupillary response: Luminance-independent measure of mental effort. , 2015, International journal of psychophysiology : official journal of the International Organization of Psychophysiology.

[41]  Michele Lanza,et al.  I know what you did last summer: an investigation of how developers spend their time , 2015, ICPC '15.

[42]  F. Shaffer,et al.  Heart Rate Variability: New Perspectives on Physiological Mechanisms, Assessment of Self-regulatory Capacity, and Health risk , 2015, Global advances in health and medicine.

[43]  Andrew Begel,et al.  Using psycho-physiological measures to assess task difficulty in software development , 2014, ICSE.

[44]  Daniel M. Germán,et al.  Quantifying programmers' mental workload during program comprehension based on cerebral blood flow measurement: a controlled experiment , 2014, ICSE Companion.

[45]  Sheeraz Akram,et al.  Kruskal-Wallis-Based Computationally Efficient Feature Selection for Face Recognition , 2014, TheScientificWorldJournal.

[46]  J. Sacha Interaction between Heart Rate and Heart Rate Variability , 2014, Annals of noninvasive electrocardiology : the official journal of the International Society for Holter and Noninvasive Electrocardiology, Inc.

[47]  Michael W. Godfrey,et al.  The influence of non-technical factors on code review , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[48]  Dongmei Zhang,et al.  How do software engineers understand code changes?: an exploratory study in industry , 2012, SIGSOFT FSE.

[49]  Marco Torchiano,et al.  The impact of process maturity on defect density , 2012, Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement.

[50]  Alberto Bacchelli,et al.  Expectations, outcomes, and challenges of modern code review , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[51]  A. Fuqun Huang,et al.  A Taxonomy System to Identify Human Error Causes for Software Defects , 2012 .

[52]  Jonathan I. Maletic,et al.  An eye-tracking study on the role of scan time in finding source code defects , 2012, ETRA.

[53]  Andreas Busjahn,et al.  Analysis of code reading to gain more insight in program comprehension , 2011, Koli Calling.

[54]  Margaret-Anne D. Storey,et al.  Understanding broadcast based peer review on open source software projects , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[55]  Gerhard Tröster,et al.  Discriminating Stress From Cognitive Load Using a Wearable EDA Device , 2010, IEEE Transactions on Information Technology in Biomedicine.

[56]  Hongyu Zhang,et al.  An investigation of the relationships between lines of code and defects , 2009, 2009 IEEE International Conference on Software Maintenance.

[57]  Mark C. Paulk,et al.  The Impact of Design and Code Reviews on Software Quality: An Empirical Study Based on PSP Data , 2009, IEEE Transactions on Software Engineering.

[58]  Mika Mäntylä,et al.  What Types of Defects Are Really Discovered in Code Reviews? , 2009, IEEE Transactions on Software Engineering.

[59]  Les Hatton,et al.  Testing the Value of Checklists in Code Inspections , 2008, IEEE Software.

[60]  Sandra G. Hart,et al.  Nasa-Task Load Index (NASA-TLX); 20 Years Later , 2006 .

[61]  J. Cadzow Maximum Entropy Spectral Analysis , 2006 .

[62]  Akito Monden,et al.  Analyzing individual performance of source code review using reviewers' eye movement , 2006, ETRA.

[63]  Stefan Biffl,et al.  Investigating the influence of inspector capability factors with four inspection techniques on inspection performance , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[64]  Barry W. Boehm,et al.  What we have learned about fighting defects , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[65]  Forrest Shull,et al.  Improving software inspections by using reading techniques , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[66]  Marc Roper,et al.  The role of comprehension in software inspection , 2000, J. Syst. Softw..

[67]  Arthur L. Price,et al.  Managing code inspection information , 1994, IEEE Software.

[68]  Inderpal S. Bhandari,et al.  Orthogonal Defect Classification - A Concept for In-Process Measurements , 1992, IEEE Trans. Software Eng..

[69]  A. Frank Ackerman,et al.  Software inspections: an effective verification process , 1989, IEEE Software.

[70]  S. Porges,et al.  Heart rate and respiratory responses as a function of task difficulty: the use of discriminant analysis in the selection of psychologically sensitive physiological responses. , 1976, Psychophysiology.

[71]  Michael E. Fagan Design and Code Inspections to Reduce Errors in Program Development , 1976, IBM Syst. J..

[72]  D Kahneman,et al.  Pupil Diameter and Load on Memory , 1966, Science.

[73]  Gene M. Alarcon,et al.  Using Eye-Tracking Data to Compare Differences in Code Comprehension and Code Perceptions between Expert and Novice Programmers , 2021, HICSS.

[74]  Andrew Begel,et al.  Affect Recognition in Code Review: An In-situ Biometric Study of Reviewer's Affect , 2020, J. Syst. Softw..

[75]  Chris Sauer,et al.  Technical Reviews: A Behaviorally Motivated Program of Research , 2022 .

[76]  J. Veltman,et al.  Physiological workload reactions to increasing levels of task difficulty. , 1998, Ergonomics.