EISA: An Efficient Information Theoretical Approach to Value Segmentation in Large Databases

Value disparity is a widely known problem, that contributes to poor data quality results and raises many issues in data integration tasks. Value disparity, also known as column heterogeneity, occurs when the same entity is represented by disparate values, often within the same column in a database table. A first step in overcoming value disparity is to identify the distinct segments. This is a highly challenging task due to high number of features that define a particular segment as well as the need to undertake value comparisons which can be exponential in large databases. In this paper, we propose an efficient information theoretical approach to value segmentation, namely EISA. EISA not only reduces the number of the relevant features but also compresses the size of the values to be segmented. We have applied our method on three datasets with varying sizes. Our experimental evaluation of the method demonstrates a high level of accuracy with reasonable efficiency.

[1]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[2]  Bei Yu,et al.  On generating near-optimal tableaux for conditional functional dependencies , 2008, Proc. VLDB Endow..

[3]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[4]  Beng Chin Ooi,et al.  Automatic discovery of attributes in relational databases , 2011, SIGMOD '11.

[5]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[6]  Divesh Srivastava,et al.  Type-based categorization of relational attributes , 2009, EDBT '09.

[7]  Beng Chin Ooi,et al.  Rapid Identification of Column Heterogeneity , 2006, Sixth International Conference on Data Mining (ICDM'06).

[8]  Renée J. Miller,et al.  Information-theoretic tools for mining database structure from large data sets , 2004, SIGMOD '04.

[9]  Hannu Toivonen,et al.  Effective Pruning for the Discovery of Conditional Functional Dependencies , 2013, Comput. J..

[10]  Anthony K. H. Tung,et al.  Validating Multi-column Schema Matchings by Type , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[13]  Jan Chomicki,et al.  Hippo: A System for Computing Consistent Answers to a Class of SQL Queries , 2004, EDBT.

[14]  Arthur Zimek,et al.  Discriminative features for identifying and interpreting outliers , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[15]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[16]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.