Efficient Detection of Soft Concatenation Mapping

In modern big data warehouse systems, we observe a common phenomenon that a column of data values can be derived from one or several other columns by transforming and concatenating these columns. We call this relationship between columns a Soft Concatenation Mapping (SCM). SCMs imply significant redundancy in the schema or data, and therefore can be exploited for data integration or data compression. In this paper, we formalize the problem of SCM detection and prove it is NP-hard. We then propose efficient approximate algorithms to detect all SCMs or an optimal set of SCMs in a table. Our experiments on both real-world and synthetic datasets show promising results.

[1]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[2]  Felix Naumann,et al.  Schema matching using duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[3]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[4]  Anthony K. H. Tung,et al.  Validating Multi-column Schema Matchings by Type , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[5]  C. Robert Carlson,et al.  The Application of Functional Dependency Theory to Relational Databases , 1982, Comput. J..

[6]  Qiong Luo,et al.  TICC: Transparent Inter-Column Compression for Column-Oriented Database Systems , 2017, CIKM.

[7]  Per-Åke Larson,et al.  SQL server column store indexes , 2011, SIGMOD '11.

[8]  Edward L. Robertson,et al.  On approximation measures for functional dependencies , 2004, Inf. Syst..

[9]  Ronald S. King,et al.  Discovery of functional and approximate functional dependencies in relational databases , 2003, Adv. Decis. Sci..

[10]  David W. Embley,et al.  Automatic direct and indirect schema mapping: experiences and lessons learned , 2004, SGMD.

[11]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[12]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[13]  David W. Embley,et al.  Discovering direct and indirect matches for schema elements , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[14]  Felix Naumann,et al.  A Hybrid Approach to Functional Dependency Discovery , 2016, SIGMOD Conference.

[15]  Eleni Stroulia,et al.  From relations to multi-dimensional maps: a SQL-to-HBase transformation methodology , 2016, CASCON.

[16]  Jean-François Boulicaut,et al.  Towards the reverse engineering of renormalized relational databases , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[17]  Qiong Luo,et al.  Cuttle: Enabling Cross-Column Compression in Distributed Column Stores , 2017, APWeb/WAIM.

[18]  Sharon L. Lohr,et al.  Sampling: Design and Analysis , 1999 .

[19]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[20]  Paolo Papotti,et al.  BigDansing: A System for Big Data Cleansing , 2015, SIGMOD Conference.

[21]  Lei Chen,et al.  Reducing Uncertainty of Schema Matching via Crowdsourcing , 2013, Proc. VLDB Endow..

[22]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[23]  Paul Brown,et al.  BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data , 2003, VLDB.

[24]  Felix Naumann,et al.  Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms , 2015, Proc. VLDB Endow..

[25]  Chengfei Liu,et al.  Discover Dependencies from Data—A Review , 2012, IEEE Transactions on Knowledge and Data Engineering.

[26]  Jignesh M. Patel,et al.  WideTable: An Accelerator for Analytical Data Processing , 2014, Proc. VLDB Endow..

[27]  William A. Giovinazzo Object-Oriented Data Warehouse Design: Building A Star Schema , 2000 .

[28]  Georg Gottlob,et al.  Schema mapping discovery from data instances , 2010, JACM.

[29]  Frank Wm. Tompa,et al.  Multi-column substring matching for database schema translation , 2006, VLDB.

[30]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..