On the Impact of Tokenizer and Parameters on N-Gram Based Code Analysis

Recent research shows that language models, such as n-gram models, are useful at a wide variety of software engineering tasks, e.g., code completion, bug identification, code summarisation, etc. However, such models require the appropriate set of numerous parameters. Moreover, the different ways one can read code essentially yield different models (based on the different sequences of tokens). In this paper, we focus on n-gram models and evaluate how the use of tokenizers, smoothing, unknown threshold and n values impact the predicting ability of these models. Thus, we compare the use of multiple tokenizers and sets of different parameters (smoothing, unknown threshold and n values) with the aim of identifying the most appropriate combinations. Our results show that the Modified Kneser-Ney smoothing technique performs best, while n values are depended on the choice of the tokenizer, with values 4 or 5 offering a good trade-off between entropy and computation time. Interestingly, we find that tokenizers treating the code as simple text are the most robust ones. Finally, we demonstrate that the differences between the tokenizers are of practical importance and have the potential of changing the conclusions of a given experiment.

[1]  Robert C. Martin Agile Software Development, Principles, Patterns, and Practices , 2002 .

[2]  Kenneth Ward Church,et al.  Poor Estimates of Context are Worse than None , 1990, HLT.

[3]  Karen Sparck Jones Natural Language Processing: A Historical Review , 1994 .

[4]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[5]  Eran Yahav,et al.  Extracting code from programming tutorial videos , 2016, Onward!.

[6]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[7]  Premkumar T. Devanbu,et al.  On the localness of software , 2014, SIGSOFT FSE.

[8]  Steffen Staab,et al.  A Generalized Language Model as the Combination of Skipped n-grams and Modified Kneser Ney Smoothing , 2014, ACL.

[9]  Zhi Jin,et al.  CodeSum: Translate Program Language to Natural Language , 2017, ArXiv.

[10]  Devin Chollak,et al.  Bugram: Bug detection with n-gram language models , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[11]  Premkumar T. Devanbu,et al.  Will They Like This? Evaluating Code Contributions with Language Models , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[12]  Premkumar T. Devanbu,et al.  On the "naturalness" of buggy code , 2015, ICSE.

[13]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[14]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[15]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[16]  Dan Klein,et al.  Abstract Syntax Networks for Code Generation and Semantic Parsing , 2017, ACL.

[17]  Armando Solar-Lezama,et al.  sk_p: a neural program corrector for MOOCs , 2016, SPLASH.

[18]  Satish Narayanasamy,et al.  Using web corpus statistics for program analysis , 2014, OOPSLA.

[19]  José Nelson Amaral,et al.  Syntax errors just aren't natural: improving error reporting with language models , 2014, MSR 2014.

[20]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[21]  Hongseok Yang,et al.  Learning a strategy for adapting a program analysis via bayesian optimisation , 2015, OOPSLA.

[22]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[23]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[24]  Denys Poshyvanyk,et al.  A comprehensive model for code readability , 2018, J. Softw. Evol. Process..

[25]  Han Liu,et al.  Towards Better Program Obfuscation: Optimization via Language Models , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[26]  David Lo,et al.  NIRMAL: Automatic identification of software relevant tweets leveraging language model , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[27]  Kenneth Ward Church,et al.  - 1-What ’ s Wrong with Adding One ? , 1994 .

[28]  Adam A. Porter,et al.  Learning a classifier for false positive error reports emitted by static code analysis tools , 2017, MAPL@PLDI.

[29]  Anh Tuan Nguyen,et al.  Graph-Based Statistical Language Model for Code , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[30]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[31]  Yves Le Traon,et al.  TUNA: TUning Naturalness-Based Analysis , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[32]  Anh Tuan Nguyen,et al.  A statistical semantic language model for source code , 2013, ESEC/FSE 2013.

[33]  José Nelson Amaral,et al.  Syntax and sensibility: Using language models to detect and correct syntax errors , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[34]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[35]  Wang Ling,et al.  Latent Predictor Networks for Code Generation , 2016, ACL.

[36]  Christian Bird,et al.  Products, developers, and milestones: how should I build my N-Gram language model , 2015, ESEC/SIGSOFT FSE.