PYInfer: Deep Learning Semantic Type Inference for Python Variables

Python type inference is challenging in practice. Due to its dynamic properties and extensive dependencies on third-party libraries without type annotations, the performance of traditional static analysis techniques is limited. Although semantics in source code can help manifest intended usage for variables (thus help infer types), they are usually ignored by existing tools. In this paper, we propose PYInfer, an end-to-end learning-based type inference tool that automatically generates type annotations for Python variables. The key insight is that contextual code semantics is critical in inferring the type for a variable. For each use of a variable, we collect a few tokens within its contextual scope, and design a neural network to predict its type. One challenge is that it is difficult to collect a high-quality human-labeled training dataset for this purpose. To address this issue, we apply an existing static analyzer to generate the ground truth for variables in source code. Our main contribution is a novel approach to statically infer variable types effectively and efficiently. Formulating the type inference as a classification problem, we can handle user-defined types and predict type probabilities for each variable. Our model achieves 91.2% accuracy on classifying 11 basic types in Python and 81.2% accuracy on classifying 500 most common types. Our results substantially outperform the state-of-the-art type annotators. Moreover, PYInfer achieves 5.2X more code coverage and is 187X faster than a state-of-the-art learning-based tool. With similar time consumption, our model annotates 5X more variables than a state-of-the-art static analysis tool (PySonar2). Our model also outperforms a learning-based function-level annotator (TypeWriter) on annotating types for variables and function arguments. All our tools and datasets are publicly available to facilitate future research in this direction.

[1]  D. Böhning Multinomial logistic regression algorithm , 1992 .

[2]  Isil Dillig,et al.  LambdaNet: Probabilistic Type Inference using Graph Neural Networks , 2020, ICLR.

[3]  Michael Pradel,et al.  NL2Type: Inferring JavaScript Function Types from Natural Language Information , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[4]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[5]  S. Arikawa,et al.  Byte Pair Encoding: a Text Compression Scheme That Accelerates Pattern Matching , 1999 .

[6]  Andrea Janes,et al.  Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[7]  Baowen Xu,et al.  Python probabilistic type inference with natural language support , 2016, SIGSOFT FSE.

[8]  Uri Alon,et al.  A general path-based representation for predicting program properties , 2018, PLDI.

[9]  Lin Chen,et al.  Recognizing Potential Runtime Types from Python Docstrings , 2018, SATE.

[10]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[11]  Rafael-Michael Karampatsis,et al.  Maybe Deep Neural Networks are the Best Choice for Modeling Source Code , 2019, ArXiv.

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Jim Baker,et al.  Design and evaluation of gradual typing for python , 2014, DLS.

[14]  Rishabh Singh,et al.  Learn&Fuzz: Machine learning for input fuzzing , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[15]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[16]  Rahul Gupta,et al.  DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[17]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Gang Zhao,et al.  DeepSim: deep learning code functional similarity , 2018, ESEC/SIGSOFT FSE.

[20]  Romain Robbes,et al.  Modeling Vocabulary for Big Code Machine Learning , 2019, ArXiv.

[21]  Xiao Ma,et al.  From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[22]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[23]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[24]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[25]  Georgios Gousios,et al.  TypeWriter: neural type prediction with search-based validation , 2020, ESEC/SIGSOFT FSE.

[26]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[27]  Michael Salib,et al.  Starkiller: A Static Type Inferencer and Compiler for Python , 2004 .

[28]  Martín Abadi,et al.  A Theory of Objects , 1996, Monographs in Computer Science.

[29]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[30]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[31]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[32]  Christian Bird,et al.  Deep learning type inference , 2018, ESEC/SIGSOFT FSE.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Trong Duc Nguyen,et al.  Exploring API Embedding for API Usages and Applications , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[35]  Martín Abadi,et al.  Understanding TypeScript , 2014, ECOOP.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.