Automatic structured Web databases classification

The growing structured Web databases on the web, making large-scale Deep Web data integration faces enormous challenges. Organizing such structured web databases into a hierarchy directory tree is one of critical step towards the large-scale integration of Deep Web. In this paper, a method for automatic classification of Web database is addressed. Firstly, the method for calculating the semantic similarities among the Web databases based on their interface schemas is proposed and be translated to the problem of extended optimal matching for bipartite graph. Then based on the achieved similarity matrix, an agglomerative hierarchical clustering algorithm is proposed, which can organize the Web databases into a hierarchy tree automatically. Theoretical analysis and experimental results show that the method is efficient.

[1]  Song Ling,et al.  Classification of Deep Web Databases Based on the Context of Web Pages , 2008 .

[2]  Juliana Freire,et al.  Organizing Hidden-Web Databases by Clustering Visible Web Documents , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[3]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[4]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[5]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[6]  Zhang Ying Research Survey on the Requirement-Oriented Integration of Deep Web Information , 2009 .

[7]  Yun-Fa Hu,et al.  A Method of Deep Web Classification , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[8]  Qian Liu,et al.  Automatic Hidden Web Database Classification , 2007, PKDD.

[9]  Tao Tao,et al.  Organizing structured web sources by query schemas: a clustering approach , 2004, CIKM '04.

[10]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[11]  Hassan Abolhassani,et al.  Harmony K-means algorithm for document clustering , 2009, Data Mining and Knowledge Discovery.

[12]  Boi Faltings,et al.  OSS: A Semantic Similarity Function based on Hierarchical Ontologies , 2007, IJCAI.

[13]  S. S. Ravi,et al.  Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results , 2009, Data Mining and Knowledge Discovery.