FlashProfile: a framework for synthesizing data profiles

We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios. Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters. Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153 tasks over 75 large real datasets, we observe a median profiling time of only ∼ 0.7s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as Flash Fill.

[1]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[2]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[3]  Felix Naumann,et al.  Profiling relational data: a survey , 2015, The VLDB Journal.

[4]  Sumit Gulwani,et al.  Program Synthesis , 2017, Software Systems Safety.

[5]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[6]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[7]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[8]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[9]  Pedro García,et al.  IDENTIFYING REGULAR LANGUAGES IN POLYNOMIAL TIME , 1993 .

[10]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[11]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[12]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[13]  Andreas Krause,et al.  Learning programs from noisy data , 2016, POPL.

[14]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[15]  Rishabh Singh,et al.  BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations , 2016, Proc. VLDB Endow..

[16]  Sumit Gulwani,et al.  FlashExtract: a framework for data extraction by examples , 2014, PLDI.

[17]  C. Cordell Green,et al.  What Is Program Synthesis? , 1985, J. Autom. Reason..

[18]  Richard Y. Wang,et al.  Data Quality Assessment , 2002 .

[19]  Alexander Aiken,et al.  Synthesizing program input grammars , 2016, PLDI.

[20]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[21]  Sumit Gulwani,et al.  FlashNormalize: Programming by Examples for Text Normalization , 2015, IJCAI.

[22]  Colin de la Higuera,et al.  Grammatical Inference: Learning Automata and Grammars , 2010 .

[23]  Steve Hanneke,et al.  Theory of Disagreement-Based Active Learning , 2014, Found. Trends Mach. Learn..

[24]  David E. Culler,et al.  Automated Metadata Construction to Support Portable Building Applications , 2015, BuildSys@SenSys.

[25]  Wael Hassan Gomaa,et al.  A Survey of Text Similarity Approaches , 2013 .

[26]  Tessa A. Lau,et al.  The Case Studies: Three Systems Why Programming by Demonstration Systems Fail: Lessons Learned for Usable Ai , 2022 .

[27]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[28]  Maydanchik Arkady,et al.  Data Quality Assessment , 2008 .

[29]  Ronald L. Graham,et al.  Concrete Mathematics, a Foundation for Computer Science , 1991, The Mathematical Gazette.

[30]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[31]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[32]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[33]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[34]  Dana Angluin,et al.  Learning Regular Sets from Queries and Counterexamples , 1987, Inf. Comput..

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  A Tikhonov,et al.  Solution of Incorrectly Formulated Problems and the Regularization Method , 1963 .

[37]  Ronald L. Graham,et al.  Concrete mathematics - a foundation for computer science , 1991 .

[38]  David Walker,et al.  LearnPADS + + : Incremental Inference of Ad Hoc Data Formats , 2012, PADL.

[39]  Nikos Mamoulis,et al.  The Haar+ Tree: A Refined Synopsis Data Structure , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[40]  Sumit Gulwani,et al.  User Interaction Models for Disambiguation in Programming by Example , 2015, UIST.

[41]  David Walker,et al.  LearnPADS: automatic tool generation from ad hoc data , 2008, SIGMOD Conference.

[42]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[43]  Sumit Gulwani,et al.  FlashMeta: a framework for inductive program synthesis , 2015, OOPSLA.

[44]  Sanjit A. Seshia,et al.  Combinatorial sketching for finite programs , 2006, ASPLOS XII.

[45]  Eric Medvet,et al.  Automatic generation of regular expressions from examples with genetic programming , 2012, GECCO '12.

[46]  Catherine Ramsey,et al.  Learning to Learn Programs from Examples : Going Beyond Program Structure , 2017 .