Clustering-Based Schema Matching of Web Data for Constructing Digital Library

The abundant information on the web attracts many researches on reusing the valuable web data in other information applications, for example, digital libraries. Web information published by various contributors in different ways, schema matching is a basic problem for the heterogeneous data sources integration. Web information integration arises new challenges from the following ways: web data are short of intact schema definition; and the schema matching between web data can not be simplified as 1-1 mapping problem. In this paper we propose an algorithm, COSM, to automatic the web data schema matching process. The matching process is transformed into a clustering problem: the data elements clustered into one cluster are viewed as mapping ones. COSM is mainly instance-level matching approach, also combined with a partial name matcher in calculating the elements distance metrics. A pretreatment for data is carried out to give rational distance metrics between elements before clustering step. The experiment of algorithm testing and application (applied in the Chinese folk music digital library construction) proves the algorithm’s efficiency.

[1]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[2]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[3]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[4]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[5]  Alberto H. F. Laender,et al.  The Web-DL environment for building digital libraries from the Web , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[6]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[7]  Craig A. Knoblock,et al.  Wrapper generation for semi-structured Internet sources , 1997, SGMD.

[8]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[9]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[10]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[11]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[12]  Pedro M. Domingos,et al.  Learning to map between ontologies on the semantic web , 2002, WWW '02.

[13]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[14]  Erhard Rahm,et al.  On Matching Schemas Automatically , 2001 .

[15]  William W. Cohen,et al.  Joins that Generalize: Text Classification Using WHIRL , 1998, KDD.

[16]  David W. Embley,et al.  Discovering direct and indirect matches for schema elements , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[17]  Hui Song,et al.  Data extraction and annotation for dynamic Web pages , 2004, IEEE International Conference on e-Technology, e-Commerce and e-Service, 2004. EEE '04. 2004.

[18]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.