Style-Analyzer: Fixing Code Style Inconsistencies with Interpretable Unsupervised Algorithms

Source code reviews are manual, time-consuming, and expensive. Human involvement should be focused on analyzing the most relevant aspects of the program, such as logic and maintainability, rather than amending style, syntax, or formatting defects. Some tools with linting capabilities can format code automatically and report various stylistic violations for supported programming languages. They are based on rules written by domain experts, hence, their configuration is often tedious, and it is impractical for the given set of rules to cover all possible corner cases. Some machine learning-based solutions exist, but they remain uninterpretable black boxes. This paper introduces style-analyzer, a new open source tool to automatically fix code formatting violations using the decision tree forest model which adapts to each codebase and is fully unsupervised. style-analyzer is built on top of our novel assisted code review framework, Lookout. It accurately mines the formatting style of each analyzed Git repository and expresses the found format patterns with compact human-readable rules. style-analyzer can then suggest style inconsistency fixes in the form of code review comments. We evaluate the output quality and practical relevance of style-analyzer by demonstrating that it can reproduce the original style with high precision, measured on 19 popular JavaScript projects, and by showing that it yields promising results in fixing real style mistakes. style-analyzer includes a web application to visualize how the rules are triggered. We release style-analyzer as a reusable and extendable open source software package on GitHub for the benefit of the community.

[1]  Mark C. Paulk,et al.  The Impact of Design and Code Reviews on Software Quality: An Empirical Study Based on PSP Data , 2009, IEEE Transactions on Software Engineering.

[2]  Ben Shneiderman,et al.  Program indentation and comprehensibility , 1983, CACM.

[3]  Gabriele Bavota,et al.  Deep Learning Similarities from Different Representations of Source Code , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[4]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[5]  Ricardo Vilalta,et al.  A Perspective View and Survey of Meta-Learning , 2002, Artificial Intelligence Review.

[6]  Michael Droettboom,et al.  ASDF: A new data format for astronomy , 2015 .

[7]  Luke Church,et al.  Modern Code Review: A Case Study at Google , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[8]  Anh Tuan Nguyen,et al.  A statistical semantic language model for source code , 2013, ESEC/FSE 2013.

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Michael W. Godfrey,et al.  Reading Beside the Lines: Indentation as a Proxy for Complexity Metric , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[11]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[12]  Xiaoran Wang,et al.  Automatic Segmentation of Method Code into Meaningful Blocks to Improve Readability , 2011, 2011 18th Working Conference on Reverse Engineering.

[13]  Vitaly Shmatikov,et al.  Machine Learning Models that Remember Too Much , 2017, CCS.

[14]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[15]  Charles A. Sutton,et al.  Learning natural coding conventions , 2014, SIGSOFT FSE.

[16]  Huan Liu,et al.  Feature selection for classification: A review , 2014 .

[17]  Mark van den Brand,et al.  A language independent framework for context-sensitive formatting , 2006, Conference on Software Maintenance and Reengineering (CSMR'06).

[18]  Razvan Pascanu,et al.  Meta-Learning with Latent Embedding Optimization , 2018, ICLR.

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  Cardona Alzate,et al.  Predicción y selección de variables con bosques aleatorios en presencia de variables correlacionadas , 2020 .

[21]  Ciera Jaspan,et al.  Lessons from building static analysis tools at Google , 2018, Commun. ACM.

[22]  Jurgen J. Vinju,et al.  Towards a universal code formatter through machine learning , 2016, SLE.

[23]  J. Ross Quinlan,et al.  Generating Production Rules from Decision Trees , 1987, IJCAI.