Learning programs from noisy data

We present a new approach for learning programs from noisy datasets. Our approach is based on two new concepts: a regularized program generator which produces a candidate program based on a small sample of the entire dataset while avoiding overfitting, and a dataset sampler which carefully samples the dataset by leveraging the candidate program's score on that dataset. The two components are connected in a continuous feedback-directed loop. We show how to apply this approach to two settings: one where the dataset has a bound on the noise, and another without a noise bound. The second setting leads to a new way of performing approximate empirical risk minimization on hypotheses classes formed by a discrete search space. We then present two new kinds of program synthesizers which target the two noise settings. First, we introduce a novel regularized bitstream synthesizer that successfully generates programs even in the presence of incorrect examples. We show that the synthesizer can detect errors in the examples while combating overfitting -- a major problem in existing synthesis techniques. We also show how the approach can be used in a setting where the dataset grows dynamically via new examples (e.g., provided by a human). Second, we present a novel technique for constructing statistical code completion systems. These are systems trained on massive datasets of open source programs, also known as ``Big Code''. The key idea is to introduce a domain specific language (DSL) over trees and to learn functions in that DSL directly from the dataset. These learned functions then condition the predictions made by the system. This is a flexible and powerful technique which generalizes several existing works as we no longer need to decide a priori on what the prediction should be conditioned (another benefit is that the learned functions are a natural mechanism for explaining the prediction). As a result, our code completion system surpasses the prediction capabilities of existing, hard-wired systems.

[1]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[2]  Pedro M. Domingos,et al.  Programming by demonstration: a machine learning approach , 2001 .

[3]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[4]  Sriram K. Rajamani,et al.  Efficient synthesis of probabilistic programs , 2015, PLDI.

[5]  H. WittenI.,et al.  The zero-frequency problem , 2006 .

[6]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[7]  Eran Yahav,et al.  Abstraction-guided synthesis of synchronization , 2010, POPL '10.

[8]  Sumit Gulwani,et al.  Synthesis of loop-free programs , 2011, PLDI '11.

[9]  Pavel Panchekha,et al.  Automatically improving accuracy for floating point expressions , 2015, PLDI.

[10]  Daniel S. Weld,et al.  Programming by Demonstration , 2021, Computer Vision.

[11]  Anh Tuan Nguyen,et al.  Graph-Based Statistical Language Model for Code , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[12]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[13]  Sumit Gulwani,et al.  FlashExtract: a framework for data extraction by examples , 2014, PLDI.

[14]  Sumit Gulwani,et al.  Oracle-guided component-based program synthesis , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[15]  Ruzica Piskac,et al.  Complete completion using types and weights , 2013, PLDI.

[16]  Peter Nordin,et al.  Genetic programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications , 1998 .

[17]  Sanjit A. Seshia,et al.  Combinatorial sketching for finite programs , 2006, ASPLOS XII.

[18]  Sumit Gulwani,et al.  Dimensions in program synthesis , 2010, Formal Methods in Computer Aided Design.

[19]  Charles A. Sutton,et al.  Mining source code repositories at massive scale using language modeling , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[20]  Henry S. Warren,et al.  Hacker's Delight , 2002 .

[21]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[22]  Bernhard Schölkopf,et al.  Statistical Learning Theory: Models, Concepts, and Results , 2008, Inductive Logic.

[23]  Stephen F. Smith,et al.  A learning system based on genetic adaptive algorithms , 1980 .

[24]  James E. Baker,et al.  Reducing Bias and Inefficienry in the Selection Algorithm , 1987, ICGA.

[25]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[26]  Pavol Cerný,et al.  From Boolean to quantitative synthesis , 2011, 2011 Proceedings of the Ninth ACM International Conference on Embedded Software (EMSOFT).

[27]  Butler W. Lampson,et al.  A Machine Learning Framework for Programming by Example , 2013, ICML.

[28]  Charles A. Sutton,et al.  Mining idioms from source code , 2014, SIGSOFT FSE.

[29]  Emery D. Berger,et al.  CheckCell: data debugging for spreadsheets , 2014, OOPSLA.

[30]  Anh Tuan Nguyen,et al.  A statistical semantic language model for source code , 2013, ESEC/FSE 2013.

[31]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[32]  Swarat Chaudhuri,et al.  Bridging boolean and quantitative synthesis using smoothed proof search , 2014, POPL.

[33]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[34]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[35]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[36]  Nichael Lynn Cramer,et al.  A Representation for the Adaptive Generation of Simple Sequential Programs , 1985, ICGA.