Auto-transform

Data Transformation is a long-standing problem in data management. Recent work adopts a "transform-by-example" (TBE) paradigm to infer transformation programs based on user-provided input/output examples, which greatly improves usability, and brought such features into mainstream software like Microsoft Excel, Power BI, and Trifacta. While TBE is great progress, the need for users to provide paired input/output examples still poses limits on its applicability. In this work, we study an alternative that transforms data based on input/output data patterns only (without paired examples). We term this new paradigm transform-by-patterns (TBP). Specifically, we demonstrate that there is a rich class of transformations in TBP that can be "learned" from large collections of paired table columns. We show the proposed method can harvest such transformations across diverse domains and corpora (e.g., in different languages such as English, Chinese, Spanish, etc.). TBP transformations so obtained can be used in scenarios such as suggesting data-repairs in tables, or automating transformations in ETL pipelines. Extensive experiments on real data suggest that TBP outperforms existing methods on tasks such as data repairs, and is a promising direction for future research.

[1]  Heikki Mannila,et al.  Approximate Inference of Functional Dependencies from Relations , 1995, Theor. Comput. Sci..

[2]  Rishabh Singh,et al.  BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations , 2016, Proc. VLDB Endow..

[3]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[4]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[5]  Michael Stonebraker,et al.  Raha: A Configuration-Free Error Detection System , 2019, SIGMOD Conference.

[6]  AnHai Doan,et al.  Deep entity matching with pre-trained language models , 2020, VLDB 2020.

[7]  Ahmed K. Elmagarmid,et al.  Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes , 2013, SIGMOD '13.

[8]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[9]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[10]  Ryan Wisnesky,et al.  Orchid: Integrating Schema Mapping and ETL , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Michael Stonebraker,et al.  DataXFormer: A robust transformation discovery system , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[12]  Yeye He,et al.  Auto-Join: Joining Tables by Leveraging Transformations , 2017, Proc. VLDB Endow..

[13]  Paolo Papotti,et al.  Discovering Denial Constraints , 2013, Proc. VLDB Endow..

[14]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[15]  Phokion G. Kolaitis,et al.  Repair checking in inconsistent databases: algorithms and complexity , 2009, ICDT '09.

[16]  Michael Stonebraker,et al.  Detecting Data Errors: Where are we and what needs to be done? , 2016, Proc. VLDB Endow..

[17]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[18]  J. Manthorpe Land Registration and Land Valuation in the United Kingdom and in the Countries of the United Nations Economic Commission for Europe (UNECE) , 1998 .

[19]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[20]  Jeffrey Heer,et al.  Predictive Interaction for Data Transformation , 2015, CIDR.

[21]  David Walker,et al.  From dirt to shovels: fully automatic tool generation from ad hoc data , 2008, POPL '08.

[22]  Yeye He,et al.  Uni-Detect: A Unified Approach to Automated Error Detection in Tables , 2019, SIGMOD Conference.

[23]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[24]  Michael Stonebraker,et al.  ANMAT: Automatic Knowledge Discovery and Error Detection through Pattern Functional Dependencies , 2019, SIGMOD Conference.

[25]  H. V. Jagadish,et al.  Foofah: Transforming Data By Example , 2017, SIGMOD Conference.

[26]  Theodoros Rekatsinas,et al.  HoloDetect: Few-Shot Learning for Error Detection , 2019, SIGMOD Conference.

[27]  Ziawasch Abedjan,et al.  ED2: A Case for Active Learning in Error Detection , 2019, CIKM.

[28]  Jinfeng Li,et al.  Sato: Contextual Semantic Type Detection in Tables , 2020, Proc. VLDB Endow..

[29]  Bertram Ludäscher,et al.  An Ontology-Driven Framework for Data Transformation in Scientific Workflows , 2004, DILS.

[30]  Felix Naumann,et al.  Discovery of Genuine Functional Dependencies from Relational Data with Missing Values , 2018, Proc. VLDB Endow..

[31]  Yeye He,et al.  Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations , 2018, Proc. VLDB Endow..

[32]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[33]  C. Cleverdon Report on the testing and analysis of an investigation into comparative efficiency of indexing systems , 1962 .

[34]  Cong Yan,et al.  Synthesizing Type-Detection Logic for Rich Semantic Data Types using Open-source Code , 2018, SIGMOD Conference.

[35]  Alexandra Poulovassilis,et al.  Data integration by bi-directional schema transformation rules , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[36]  Joseph Naor,et al.  Submodular Maximization with Cardinality Constraints , 2014, SODA.

[37]  Yue Wang,et al.  Synthesizing Mapping Relationships Using Table Corpus , 2017, SIGMOD Conference.

[38]  H. V. Jagadish,et al.  CLX: Towards verifiable PBE data transformation , 2019, EDBT.

[39]  Joseph M. Hellerstein,et al.  Quantitative Data Cleaning for Large Databases , 2008 .

[40]  Yeye He,et al.  Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning , 2019, WWW.

[41]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[42]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[43]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[44]  Subhash Khot,et al.  Ruling out PTAS for graph min-bisection, densest subgraph and bipartite clique , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[45]  M. R. Rao,et al.  Combinatorial Optimization , 1992, NATO ASI Series.

[46]  Michael Stonebraker,et al.  Unsupervised String Transformation Learning for Entity Consolidation , 2017, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[47]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[48]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[49]  Yeye He,et al.  SEMA-JOIN: Joining Semantically-Related Tables Using Big Table Corpora , 2015, Proc. VLDB Endow..

[50]  Yue Wang,et al.  Transform-Data-by-Example (TDE): Extensible Data Transformation in Excel , 2018, SIGMOD Conference.

[51]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[52]  Saravanan Thirumuruganathan,et al.  ZeroER: Entity Resolution using Zero Labeled Examples , 2019, SIGMOD Conference.

[53]  Yeye He,et al.  Auto-Detect: Data-Driven Error Detection in Tables , 2018, SIGMOD Conference.

[54]  Yeye He,et al.  Data services leveraging Bing's data assets , 2016, IEEE Data Eng. Bull..

[55]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[56]  Timos K. Sellis,et al.  Optimizing ETL processes in data warehouses , 2005, 21st International Conference on Data Engineering (ICDE'05).