Learning constraints in spreadsheets and tabular data

Spreadsheets, comma separated value files and other tabular data representations are in wide use today. However, writing, maintaining and identifying good formulas for tabular data and spreadsheets can be time-consuming and error-prone. We investigate the automatic learning of constraints (formulas and relations) in raw tabular data in an unsupervised way. We represent common spreadsheet formulas and relations through predicates and expressions whose arguments must satisfy the inherent properties of the constraint. The challenge is to automatically infer the set of constraints present in the data, without labeled examples or user feedback. We propose a two-stage generate and test method where the first stage uses constraint solving techniques to efficiently reduce the number of candidates, based on the predicate signatures. Our approach takes inspiration from inductive logic programming, constraint learning and constraint satisfaction. We show that we are able to accurately discover constraints in spreadsheets from various sources.

[1]  Arnaud Lallouet,et al.  On Learning Constraint Problems , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[2]  Peter A. Flach,et al.  Database Dependency Discovery: A Machine Learning Approach , 1999, AI Commun..

[3]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[4]  Heikki Mannila,et al.  Algorithms for Inferring Functional Dependencies from Relations , 1994, Data Knowl. Eng..

[5]  Y. Chauhan,et al.  Growth in a Time of Debt , 2015 .

[6]  Luc De Raedt,et al.  Clausal Discovery , 1997, Machine Learning.

[7]  Thomas C. Herndon,et al.  Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff , 2014 .

[8]  Claudio V. Russo,et al.  Tabular: a schema-driven probabilistic programming language , 2014, POPL.

[9]  Henry C. Lucas,et al.  Toward a logical/physical theory of spreadsheet modeling , 1992, TOIS.

[10]  Ljupco Todorovski,et al.  Equation Discovery , 2010, Encyclopedia of Machine Learning and Data Mining.

[11]  Luc De Raedt Logical and Relational Learning , 2008, SBIA.

[12]  Zhi Tang,et al.  Table Header Detection and Classification , 2012, AAAI.

[13]  Toby Walsh,et al.  Constraint Acquisition via Partial Queries , 2013, IJCAI.

[14]  Peter A. Flach,et al.  Discovery of multivalued dependencies from relations , 2000, Intell. Data Anal..

[15]  Sumit Gulwani,et al.  FlashExtract: a framework for data extraction by examples , 2014, PLDI.

[16]  Gilles Pesant,et al.  Principles and Practice of Constraint Programming , 2015, Lecture Notes in Computer Science.

[17]  Barry O'Sullivan,et al.  Acquiring Constraint Networks Using a SAT-based Version Space Algorithm , 2006, AAAI.

[18]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[19]  Patrick Shafto,et al.  BayesDB: A probabilistic programming system for querying the probable implications of data , 2015, ArXiv.

[20]  Barry O'Sullivan,et al.  A SAT-Based Version Space Algorithm for Acquiring Constraint Satisfaction Problems , 2005, ECML.