论文信息 - Learning programs from noisy data

Learning programs from noisy data

We present a new approach for learning programs from noisy datasets. Our approach is based on two new concepts: a regularized program generator which produces a candidate program based on a small sample of the entire dataset while avoiding overfitting, and a dataset sampler which carefully samples the dataset by leveraging the candidate program's score on that dataset. The two components are connected in a continuous feedback-directed loop. We show how to apply this approach to two settings: one where the dataset has a bound on the noise, and another without a noise bound. The second setting leads to a new way of performing approximate empirical risk minimization on hypotheses classes formed by a discrete search space. We then present two new kinds of program synthesizers which target the two noise settings. First, we introduce a novel regularized bitstream synthesizer that successfully generates programs even in the presence of incorrect examples. We show that the synthesizer can detect errors in the examples while combating overfitting -- a major problem in existing synthesis techniques. We also show how the approach can be used in a setting where the dataset grows dynamically via new examples (e.g., provided by a human). Second, we present a novel technique for constructing statistical code completion systems. These are systems trained on massive datasets of open source programs, also known as ``Big Code''. The key idea is to introduce a domain specific language (DSL) over trees and to learn functions in that DSL directly from the dataset. These learned functions then condition the predictions made by the system. This is a flexible and powerful technique which generalizes several existing works as we no longer need to decide a priori on what the prediction should be conditioned (another benefit is that the learned functions are a natural mechanism for explaining the prediction). As a result, our code completion system surpasses the prediction capabilities of existing, hard-wired systems.

[1] Ian H. Witten,et al. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[2] Pedro M. Domingos,et al. Programming by demonstration: a machine learning approach , 2001 .

[3] Andreas Krause,et al. Predicting Program Properties from "Big Code" , 2015, POPL.

[4] Sriram K. Rajamani,et al. Efficient synthesis of probabilistic programs , 2015, PLDI.

[5] H. WittenI.,et al. The zero-frequency problem , 2006 .

[6] Sariel Har-Peled,et al. On coresets for k-means and k-median clustering , 2004, STOC '04.

[7] Eran Yahav,et al. Abstraction-guided synthesis of synchronization , 2010, POPL '10.

[8] Sumit Gulwani,et al. Synthesis of loop-free programs , 2011, PLDI '11.

[9] Pavel Panchekha,et al. Automatically improving accuracy for floating point expressions , 2015, PLDI.

[10] Daniel S. Weld,et al. Programming by Demonstration , 2021, Computer Vision.

[11] Anh Tuan Nguyen,et al. Graph-Based Statistical Language Model for Code , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[12] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[13] Sumit Gulwani,et al. FlashExtract: a framework for data extraction by examples , 2014, PLDI.

[14] Sumit Gulwani,et al. Oracle-guided component-based program synthesis , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[15] Ruzica Piskac,et al. Complete completion using types and weights , 2013, PLDI.

[16] Peter Nordin,et al. Genetic programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications , 1998 .

[17] Sanjit A. Seshia,et al. Combinatorial sketching for finite programs , 2006, ASPLOS XII.

[18] Sumit Gulwani,et al. Dimensions in program synthesis , 2010, Formal Methods in Computer Aided Design.

[19] Charles A. Sutton,et al. Mining source code repositories at massive scale using language modeling , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[20] Henry S. Warren,et al. Hacker's Delight , 2002 .

[21] Nikolaj Bjørner,et al. Z3: An Efficient SMT Solver , 2008, TACAS.

[22] Bernhard Schölkopf,et al. Statistical Learning Theory: Models, Concepts, and Results , 2008, Inductive Logic.

[23] Stephen F. Smith,et al. A learning system based on genetic adaptive algorithms , 1980 .

[24] James E. Baker,et al. Reducing Bias and Inefficienry in the Selection Algorithm , 1987, ICGA.

[25] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[26] Pavol Cerný,et al. From Boolean to quantitative synthesis , 2011, 2011 Proceedings of the Ninth ACM International Conference on Embedded Software (EMSOFT).

[27] Butler W. Lampson,et al. A Machine Learning Framework for Programming by Example , 2013, ICML.

[28] Charles A. Sutton,et al. Mining idioms from source code , 2014, SIGSOFT FSE.

[29] Emery D. Berger,et al. CheckCell: data debugging for spreadsheets , 2014, OOPSLA.

[30] Anh Tuan Nguyen,et al. A statistical semantic language model for source code , 2013, ESEC/FSE 2013.

[31] VARUN CHANDOLA,et al. Anomaly detection: A survey , 2009, CSUR.

[32] Swarat Chaudhuri,et al. Bridging boolean and quantitative synthesis using smoothed proof search , 2014, POPL.

[33] Sumit Gulwani,et al. Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[34] Premkumar T. Devanbu,et al. On the naturalness of software , 2016, Commun. ACM.

[35] Eran Yahav,et al. Code completion with statistical language models , 2014, PLDI.

[36] Nichael Lynn Cramer,et al. A Representation for the Adaptive Generation of Simple Sequential Programs , 1985, ICGA.