A Collective, Probabilistic Approach to Schema Mapping

We propose a probabilistic approach to the problem of schema mapping. Our approach is declarative, scalable, and extensible. It builds upon recent results in both schema mapping and probabilistic reasoning and contributes novel techniques in both fields. We introduce the problem of mapping selection, that is, choosing the best mapping from a space of potential mappings, given both metadata constraints and a data example. As selection has to reason holistically about the inputs and the dependencies between the chosen mappings, we define a new schema mapping optimization problem which captures interactions between mappings. We then introduce Collective Mapping Discovery (CMD), our solution to this problem using stateof- the-art probabilistic reasoning techniques, which allows for inconsistencies and incompleteness. Using hundreds of realistic integration scenarios, we demonstrate that the accuracy of CMD is more than 33% above that of metadata-only approaches already for small data examples, and that CMD routinely finds perfect mappings even if a quarter of the data is inconsistent

[1]  Paolo Papotti,et al.  ++Spicy: an OpenSource Tool for Second-Generation Schema Mapping and Data Exchange , 2011, Proc. VLDB Endow..

[2]  Georg Gottlob,et al.  Schema mapping discovery from data instances , 2010, JACM.

[3]  Phokion G. Kolaitis,et al.  Learning schema mappings , 2012, ICDT '12.

[4]  Laura M. Haas,et al.  Schema Mapping as Query Discovery , 2000, VLDB.

[5]  Angela Bonifati,et al.  Schema mapping verification: the spicy way , 2008, EDBT '08.

[6]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[7]  Phokion G. Kolaitis Schema mappings and data examples , 2011, LID '11.

[8]  Phokion G. Kolaitis,et al.  Structural characterizations of schema-mapping languages , 2010 .

[9]  Renée J. Miller,et al.  A Collective, Probabilistic Approach to Schema Mapping: Appendix , 2017, ArXiv.

[10]  Phokion G. Kolaitis,et al.  Designing and refining schema mappings via data examples , 2011, SIGMOD '11.

[11]  Jennifer Widom,et al.  Synthesizing view definitions from data , 2010, ICDT '10.

[12]  James R. Foulds,et al.  HyPER: A Flexible and Extensible Probabilistic Framework for Hybrid Recommender Systems , 2015, RecSys.

[13]  Andrew McCallum,et al.  Introduction to Statistical Relational Learning , 2007 .

[14]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[15]  Laura M. Haas,et al.  Data-driven understanding and refinement of schema mappings , 2001, SIGMOD '01.

[16]  Laks V. S. Lakshmanan,et al.  HePToX: Marrying XML and Heterogeneity in Your P2P Databases , 2005, VLDB.

[17]  Louiqa Raschid,et al.  Ieee/acm Transactions on Computational Biology and Bioinformatics 1 Network-based Drug-target Interaction Prediction with Probabilistic Soft Logic , 2022 .

[18]  Laura M. Haas,et al.  Clio: Schema Mapping Creation and Data Exchange , 2009, Conceptual Modeling: Foundations and Applications.

[19]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[20]  Denilson Barbosa,et al.  ToXgene: a template-based data generator for XML , 2002, SIGMOD '02.

[21]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[22]  Ahmed K. Elmagarmid,et al.  Leveraging query logs for schema mapping generation in U-MAP , 2011, SIGMOD '11.

[23]  Partha Pratim Talukdar,et al.  Actively Soliciting Feedback for Query Answers in Keyword Search-Based Data Integration , 2013, Proc. VLDB Endow..

[24]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[25]  Renée J. Miller,et al.  Muse: Mapping Understanding and deSign by Example , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[26]  Li Qian,et al.  Sample-driven schema mapping , 2012, SIGMOD Conference.

[27]  Ronald Fagin,et al.  Translating Web Data , 2002, VLDB.

[28]  Charles Audet,et al.  Mesh Adaptive Direct Search Algorithms for Constrained Optimization , 2006, SIAM J. Optim..

[29]  Paolo Papotti,et al.  IQ-METER - An evaluation tool for data-transformation systems , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[30]  Karl Aberer,et al.  Pay-as-you-go reconciliation in schema matching networks , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[31]  Renée J. Miller,et al.  The iBench Integration Metadata Generator , 2015, Proc. VLDB Endow..

[32]  Phokion G. Kolaitis,et al.  EIRENE: Interactive Design and Refinement of Schema Mappings via Data Examples , 2011, Proc. VLDB Endow..

[33]  Stephen H. Bach,et al.  Hinge-Loss Markov Random Fields and Probabilistic Soft Logic , 2015, J. Mach. Learn. Res..

[34]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[35]  Norman W. Paton,et al.  Incrementally improving dataspaces based on user feedback , 2013, Inf. Syst..

[36]  Heiner Stuckenschmidt,et al.  A Probabilistic-Logical Framework for Ontology Matching , 2010, AAAI.

[37]  Daniel S. Weld,et al.  Ontological Smoothing for Relation Extraction with Minimal Supervision , 2012, AAAI.

[38]  Phokion G. Kolaitis,et al.  Approximation Algorithms for Schema-Mapping Discovery from Data Examples , 2015, AMW.

[39]  Lise Getoor,et al.  Knowledge Graph Identification , 2013, SEMWEB.