TYPifier: Inferring the type semantics of structured data

Structured data representing entity descriptions often lacks precise type information. That is, it is not known to which type an entity belongs to, or the type is too general to be useful. In this work, we propose to deal with this novel problem of inferring the type semantics of structured data, called typification. We formulate it as a clustering problem and discuss the features needed to obtain several solutions based on existing clustering solutions. Because schema features perform best, but are not abundantly available, we propose an approach to automatically derive them from data. Optimized for the use of schema features, we present TYPifier, a novel clustering algorithm that in experiments, yields better typification results than the baseline clustering solutions.

[1]  Tengke Xiong,et al.  DHCC: Divisive hierarchical clustering of categorical data , 2011, Data Mining and Knowledge Discovery.

[2]  Dan Suciu,et al.  Adding Structure to Unstructured Data , 1997, ICDT.

[3]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[4]  Patric R. J. Östergård,et al.  A fast algorithm for the maximum clique problem , 2002, Discret. Appl. Math..

[5]  Surajit Chaudhuri,et al.  Example-driven design of efficient record matching queries , 2007, VLDB.

[6]  Ehud Gudes,et al.  Exploiting local similarity for indexing paths in graph-structured data , 2002, Proceedings 18th International Conference on Data Engineering.

[7]  M. Parimala,et al.  Graph clustering based on Structural Attribute Neighborhood Similarity (SANS) , 2015, 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT).

[8]  Peter L. Hammer,et al.  Discrete Applied Mathematics , 1993 .

[9]  Sukumar Nandi,et al.  A distance based clustering method for arbitrary shaped clusters in large datasets , 2011, Pattern Recognit..

[10]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[11]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[12]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[13]  Malik Magdon-Ismail,et al.  SSDE-Cluster: Fast Overlapping Clustering of Networks Using Sampled Spectral Distance Embedding and GMMs , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[14]  Yongtao Ma,et al.  TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration , 2013, WSDM.

[15]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[16]  Ming-Syan Chen,et al.  Combining Partitional and Hierarchical Algorithms for Robust and Efficient Data Clustering with Cohesion Self-Merging , 2005, IEEE Trans. Knowl. Data Eng..

[17]  Jiang-She Zhang,et al.  Improved possibilistic C-means clustering algorithms , 2004, IEEE Trans. Fuzzy Syst..

[18]  Paul M. B. Vitányi,et al.  Author ' s personal copy A Fast Quartet tree heuristic for hierarchical clustering , 2010 .

[19]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[20]  P. Viswanath,et al.  Rough-DBSCAN: A fast hybrid density based clustering method for large data sets , 2009, Pattern Recognit. Lett..

[21]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[22]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[23]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[24]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[25]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[26]  Malik Magdon-Ismail,et al.  Finding communities by clustering a graph into overlapping subgraphs , 2005, IADIS AC.