Type-based categorization of relational attributes

In this work we concentrate on categorization of relational attributes based on their data type. Assuming that attribute type/characteristics are unknown or unidentifiable, we analyze and compare a variety of type-based signatures for classifying the attributes based on the semantic type of the data contained therein (e.g., router identifiers, social security numbers, email addresses). The signatures can subsequently be used for other applications as well, like clustering and index optimization/compression. This application is useful in cases where very large data collections that are generated in a distributed, ungoverned fashion end up having unknown, incomplete, inconsistent or very complex schemata and schema level meta-data. We concentrate on heuristically generating type-based attribute signatures based on both local and global computation approaches. We show experimentally that by decomposing data into q-grams and then considering signatures based on q-gram distributions, we achieve very good classification accuracy under the assumption that a large sample of the data is available for building the signatures. Then, we turn our attention to cases where a very small sample of the data is available, and hence accurately capturing the q-gram distribution of a given data type is almost impossible. We propose techniques based on dimensionality reduction and soft-clustering that exploit correlations between attributes to improve classification accuracy.

[1]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[2]  Beng Chin Ooi,et al.  Rapid Identification of Column Heterogeneity , 2006, Sixth International Conference on Data Mining (ICDM'06).

[3]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[4]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[5]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[6]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[7]  Naftali Tishby,et al.  Multivariate Information Bottleneck , 2001, Neural Computation.

[8]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[9]  J. Gower,et al.  Multivariate data analysis , 1972 .

[10]  Xianggui Qu,et al.  Multivariate Data Analysis , 2007, Technometrics.

[11]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[12]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[13]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[14]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[15]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[16]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[17]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[18]  DoanAnHai,et al.  Semantic-integration research in the database community , 2005 .

[19]  Hannu Toivonen,et al.  Efficient discovery of functional and approximate dependencies using partitions , 1998, Proceedings 14th International Conference on Data Engineering.

[20]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[21]  Chris Clifton,et al.  Experience with a Combined Approach to Attribute-Matching Across Heterogeneous Databases , 1997, DS-7.

[22]  Anthony K. H. Tung,et al.  Validating Multi-column Schema Matchings by Type , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[23]  Vipul Kashyap,et al.  Semantic and schematic similarities between database objects: a context-based approach , 1996, The VLDB Journal.

[24]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .