Integrating Web objects extracted from multiple sites into relational database
暂无分享,去创建一个
This paper studies the problem of integrating heterogeneous semi-structured Web objects into relational database. A generalized sequential learning model named the Combined Conditional Random Fields is presented for solving the problem of schema matching between pairs of heterogeneous Web data sources.The proposed model is able to learn on the manually labeled training data and unlabeled database records,thereby reducing the dependence on tediously labeled samples.It also provides a novel way to incorporate the two-dimensional neighborhood dependencies between Web data elements.Moreover,a constrained Viterbi algorithm is implemented to resolve the imposed labels inference for optimal data integration.Experimental results using a large number of Web pages from diverse domains show that the proposed method can improve the matching accuracy significantly.