Learning Semantic String Transformations from Examples

We address the problem of performing semantic transformations on strings, which may represent a variety of data types (or their combination) such as a column in a relational table, time, date, currency, etc. Unlike syntactic transformations, which are based on regular expressions and which interpret a string as a sequence of characters, semantic transformations additionally require exploiting the semantics of the data type represented by the string, which may be encoded as a database of relational tables. Manually performing such transformations on a large collection of strings is error prone and cumbersome, while programmatic solutions are beyond the skill-set of end-users. We present a programming by example technology that allows end-users to automate such repetitive tasks. We describe an expressive transformation language for semantic manipulation that combines table lookup operations and syntactic manipulations. We then present a synthesis algorithm that can learn all transformations in the language that are consistent with the user-provided set of input-output examples. We have implemented this technology as an add-in for the Microsoft Excel Spreadsheet system and have evaluated it successfully over several benchmarks picked from various Excel help-forums.

[1]  Rob Miller,et al.  Interactive Simultaneous Editing of Multiple Text Regions , 2001, USENIX ATC, General Track.

[2]  David Walker,et al.  The PADS project: an overview , 2011, ICDT '11.

[3]  Pedro M. Domingos,et al.  Programming by Demonstration Using Version Space Algebra , 2003, Machine Learning.

[4]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[5]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[6]  Surajit Chaudhuri,et al.  Transformation-based Framework for Record Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[7]  Sumit Gulwani,et al.  Synthesizing Number Transformations from Input-Output Examples , 2012, CAV.

[8]  Robert Nix,et al.  Editing by example , 1985, POPL '84.

[9]  GulwaniSumit,et al.  Learning semantic string transformations from examples , 2012, VLDB 2012.

[10]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[11]  Sumit Gulwani,et al.  Oracle-guided component-based program synthesis , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[12]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[13]  Sumit Gulwani,et al.  Spreadsheet table transformations from examples , 2011, PLDI '11.

[14]  Surajit Chaudhuri,et al.  Optimization of real conjunctive queries , 1993, PODS '93.

[15]  Tessa Lau,et al.  Why PBD systems fail: Lessons learned for usable AI , 2008 .

[16]  Frank Wm. Tompa,et al.  Multi-column substring matching for database schema translation , 2006, VLDB.

[17]  Sumit Gulwani,et al.  Spreadsheet data manipulation using examples , 2012, CACM.

[18]  David Walker,et al.  LearnPADS: automatic tool generation from ad hoc data , 2008, SIGMOD Conference.

[19]  Surajit Chaudhuri,et al.  Learning String Transformations From Examples , 2009, Proc. VLDB Endow..

[20]  AnHai Doan,et al.  iMAP: Discovering Complex Mappings between Database Schemas. , 2004, SIGMOD 2004.

[21]  Sumit Gulwani,et al.  Dimensions in program synthesis , 2010, Formal Methods in Computer Aided Design.

[22]  Jennifer Widom,et al.  Synthesizing view definitions from data , 2010, ICDT '10.

[23]  Srinivasan Parthasarathy,et al.  Query by output , 2009, SIGMOD Conference.

[24]  Tessa A. Lau,et al.  The Case Studies: Three Systems Why Programming by Demonstration Systems Fail: Lessons Learned for Usable Ai , 2022 .