Foofah: Transforming Data By Example

Data transformation is a critical first step in modern data analysis: before any analysis can be done, data from a variety of sources must be wrangled into a uniform format that is amenable to the intended analysis and analytical software package. This data transformation task is tedious, time-consuming, and often requires programming skills beyond the expertise of data analysts. In this paper, we develop a technique to synthesize data transformation programs by example, reducing this burden by allowing the analyst to describe the transformation with a small input-output example pair, without being concerned with the transformation steps required to get there. We implemented our technique in a system, FOOFAH, that efficiently searches the space of possible data transformation operations to generate a program that will perform the desired transformation. We experimentally show that data transformation programs can be created quickly with FOOFAH for a wide variety of cases, with 60% less user effort than the well-known WRANGLER system.

[1]  Sumit Gulwani,et al.  Component Based Synthesis Applied to Bitvector Circuits , 2010 .

[2]  Sumit Gulwani,et al.  Predicting a Correct Program in Programming by Example , 2015, CAV.

[3]  Fei Li,et al.  Constructing an Interactive Natural Language Interface for Relational Databases , 2014, Proc. VLDB Endow..

[4]  Craig A. Knoblock,et al.  Minimizing user effort in transforming data by example , 2014, IUI.

[5]  Dinakar Dhurjati,et al.  Scaling up Superoptimization , 2016, ASPLOS.

[6]  Zhe Chen,et al.  Senbazuru: A Prototype Spreadsheet Database Management System , 2013, Proc. VLDB Endow..

[7]  Sumit Gulwani,et al.  FlashRelate: extracting relational data from semi-structured spreadsheets using examples , 2015, PLDI.

[8]  Sumit Gulwani,et al.  Spreadsheet table transformations from examples , 2011, PLDI '11.

[9]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[10]  Pedro M. Domingos,et al.  Version Space Algebra and its Application to Programming by Demonstration , 2000, ICML.

[11]  Zhe Chen,et al.  Automatic web spreadsheet data extraction , 2013, SS@ '13.

[12]  Joseph M. Hellerstein,et al.  An Interactive Framework for Data Cleaning , 2000 .

[13]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[14]  Rishabh Singh,et al.  BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations , 2016, Proc. VLDB Endow..

[15]  Sumit Gulwani,et al.  Learning Semantic String Transformations from Examples , 2012, Proc. VLDB Endow..

[16]  Sumit Gulwani,et al.  FlashExtract: a framework for data extraction by examples , 2014, PLDI.

[17]  Horst Bunke,et al.  Bridging the Gap between Graph Edit Distance and Kernel Machines , 2007, Series in Machine Perception and Artificial Intelligence.

[18]  Alan F. Blackwell,et al.  SWYN: a visual representation for regular expressions , 2001 .

[19]  Sumit Gulwani,et al.  Spreadsheet data manipulation using examples , 2012, CACM.

[20]  Armando Solar-Lezama,et al.  Program synthesis by sketching , 2008 .

[21]  Ian H. Witten,et al.  TELS: learning text editing tasks from examples , 1993 .

[22]  Manu Sridharan,et al.  Refactoring with synthesis , 2013, OOPSLA.

[23]  Rastislav Bodík,et al.  Jungloid mining: helping to navigate the API jungle , 2005, PLDI '05.

[24]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[25]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[26]  Alvin Cheung,et al.  Optimizing database-backed applications with query synthesis , 2013, PLDI.

[27]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[28]  Jeffrey Heer,et al.  Proactive wrangling: mixed-initiative end-user programming of data transformation scripts , 2011, UIST.

[29]  Michael Stonebraker,et al.  Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[30]  Sumit Gulwani,et al.  Synthesis of loop-free programs , 2011, PLDI '11.

[31]  Maxim Likhachev,et al.  Dynamic Multi-Heuristic A* , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[32]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[33]  References , 1971 .

[34]  Rishabh Singh,et al.  Synthesizing data structure manipulations from storyboards , 2011, ESEC/FSE '11.

[35]  Craig A. Knoblock,et al.  Learning data transformation rules through examples: preliminary results , 2012, IIWeb '12.

[36]  Sorin Lerner,et al.  Interactive parser synthesis by example , 2015, PLDI.

[37]  Craig A. Knoblock,et al.  An Iterative Approach to Synthesize Data Transformation Programs , 2015, IJCAI.

[38]  Zhe Chen,et al.  Long-tail Vocabulary Dictionary Extraction from the Web , 2016, WSDM.

[39]  Sumit Gulwani,et al.  Oracle-guided component-based program synthesis , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[40]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.