Learning to Match the Schemas of Data Sources: A Multistrategy Approach

The problem of integrating data from multiple data sources—either on the Internet or within enterprises—has received much attention in the database and AI communities. The focus has been on building data integration systems that provide a uniform query interface to the sources. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the query interface and the source schemas. Examples of mappings are “element location maps to address” and “price maps to listed-price”. We propose a multistrategy learning approach to automatically find such mappings. The approach applies multiple learner modules, where each module exploits a different type of information either in the schemas of the sources or in their data, then combines the predictions of the modules using a meta-learner. Learner modules employ a variety of techniques, ranging from Naive Bayes and nearest-neighbor classification to entity recognition and information retrieval. We describe the LSD system, which employs this approach to find semantic mappings. To further improve matching accuracy, LSD exploits domain integrity constraints, user feedback, and nested structures in XML data. We test LSD experimentally on several real-world domains. The experiments validate the utility of multistrategy learning for data integration and show that LSD proposes semantic mappings with a high degree of accuracy.

[1]  Larry A. Rendell,et al.  Constructive Induction Using Fragmentary Knowledge , 1996, ICML.

[2]  Erhard Rahm,et al.  On Matching Schemas Automatically , 2001 .

[3]  Mark A. Musen,et al.  PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment , 2000, AAAI/IAAI.

[4]  Erhard Rahm,et al.  Comparison of Schema Matching Evaluations , 2002, Web, Web-Services, and Database Systems.

[5]  Marc Friedman,et al.  Efficiently Executing Information-Gathering Plans , 1997, IJCAI.

[6]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[7]  Subbarao Kambhampati,et al.  Optimizing Recursive Information-Gathering Plans , 1999, IJCAI.

[8]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[9]  Luigi Palopoli,et al.  Semi-automatic, semantic discovery of properties from database schemes , 1998, Proceedings. IDEAS'98. International Database Engineering and Applications Symposium (Cat. No.98EX156).

[10]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[11]  Hans Chalupsky,et al.  OntoMorph: A Translation System for Symbolic Knowledge , 2000, KR.

[12]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Craig A. Knoblock,et al.  Modeling Web Sources for Information Integration , 1998, AAAI/IAAI.

[14]  William W. Cohen,et al.  Joins that Generalize: Text Classification Using WHIRL , 1998, KDD.

[15]  Nicholas Kushmerick,et al.  Wrapper verification , 2000, World Wide Web.

[16]  Georg Groh,et al.  Facilitating the Exchange of Explicit Knowledge through Ontology Mappings , 2001, FLAIRS.

[17]  Mark A. Musen,et al.  Anchor-PROMPT: Using Non-Local Context for Semantic Matching , 2001, OIS@IJCAI.

[18]  Pedro M. Domingos,et al.  Learning to map between ontologies on the semantic web , 2002, WWW '02.

[19]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[20]  DoanAnHai,et al.  Learning to match ontologies on the Semantic Web , 2003, VLDB 2003.

[21]  Stephen Muggleton,et al.  Learning to Relate Terms in a Multiple Agent Environment , 1991, EWSL.

[22]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[23]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[24]  Silvana Castano,et al.  A schema analysis and reconciliation tool environment for heterogeneous databases , 1999, Proceedings. IDEAS'99. International Database Engineering and Applications Symposium (Cat. No.PR00265).

[25]  Hector Garcia-Molina,et al.  Template-based wrappers in the TSIMMIS system , 1997, SIGMOD '97.

[26]  Jennifer Widom,et al.  The TSIMMIS Approach to Mediation: Data Models and Languages , 1997, Journal of Intelligent Information Systems.

[27]  Dan Roth,et al.  The Use of Classifiers in Sequential Inference , 2001, NIPS.

[28]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[29]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[30]  Laura M. Haas,et al.  Schema Mapping as Query Discovery , 2000, VLDB.

[31]  Pedro M. Domingos,et al.  Representing and reasoning about mappings between domain models , 2002, AAAI/IAAI.

[32]  Ryutaro Ichise,et al.  Rule Induction for Concept Hierarchy Alignment , 2001, Workshop on Ontology Learning.

[33]  Ian H. Witten,et al.  Issues in Stacked Generalization , 2011, J. Artif. Intell. Res..

[34]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[35]  Deborah L. McGuinness,et al.  The Chimaera Ontology Environment , 2000, AAAI/IAAI.

[36]  Craig A. Knoblock,et al.  Wrapper generation for semi-structured Internet sources , 1997, SGMD.

[37]  Karl Weinmeister,et al.  PROVERB: The Probabilistic Cruciverbalist , 1999, AAAI/IAAI.

[38]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[39]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[40]  Prasenjit Mitra,et al.  Semi-automatic Integration of Knowledge Sources , 1999 .

[41]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[42]  Chris Clifton,et al.  Experience with a Combined Approach to Attribute-Matching Across Heterogeneous Databases , 1997, DS-7.

[43]  Oren Etzioni,et al.  Category Translation: Learning to Understand Information on the Internet , 1995, IJCAI.

[44]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[45]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.