Deep SMAnE-Deep Similarity Matrix Adjustment and Evaluation to Improve Schema Matching

Schema matching is at the basis of integrating structured and semistructured data, serving as a handy tool in multiple contemporary business and commerce applications. Being investigated in the fields of databases, AI, semantic Web and data mining for many years, the core challenge still remains the ability to create quality matchers, automatic tools for identifying correspondences among data concepts (e.g., database attributes). In this work, we offer a novel post processing step to schema matching that improves the final matching outcome without human intervention. We present a new mechanism, similarity matrix adjustment, to calibrate a matching result and propose an algorithm (dubbed ADnEV) that manipulates, using deep neural networks, similarity matrices, created by state-ofthe-art matchers. ADnEV learns two models that iteratively adjust and evaluate the original similarity matrix. We show conditions for dominance and convergence of ADnEV and demonstrate empirically the effectiveness of the proposed algorithmic solution for improving matching results, using real-world benchmark ontology and schema sets. ACM Reference Format: Roee Shraga, Avigdor Gal, and Haggai Roitman. 2019. Deep SMAnE Deep Similarity Matrix Adjustment and Evaluation to Improve Schema Matching. In Proceedings of ACM Conference (Conference’17). ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

[1]  AnHai Doan,et al.  Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[2]  Eric Peukert,et al.  AMC - A framework for modelling and comparing matching systems as matching processes , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[3]  Ioana Stanoi,et al.  Top-k generation of integrated schemas based on directed and weighted correspondences , 2009, SIGMOD Conference.

[4]  Kevin Chen-Chuan Chang,et al.  Making holistic schema matching robust: an ensemble approach , 2005, KDD '05.

[5]  Cristian Sminchisescu,et al.  Deep Learning of Graph Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Shafiq R. Joty,et al.  DeepER - Deep Entity Resolution , 2017, ArXiv.

[7]  Avigdor Gal,et al.  Boosting Schema Matchers , 2008, OTM Conferences.

[8]  Jeroen B. P. Vuurens,et al.  How Much Spam Can You Take? An Analysis of Crowdsourcing Results to Increase Accuracy , 2011 .

[9]  Avigdor Gal,et al.  From Diversity-based Prediction to Better Ontology & Schema Matching , 2016, WWW.

[10]  Avigdor Gal,et al.  Schema matching prediction with applications to data source discovery and dynamic ensembling , 2013, The VLDB Journal.

[11]  Yehuda Koren,et al.  Factorization meets the neighborhood: a multifaceted collaborative filtering model , 2008, KDD.

[12]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[13]  Wenpeng Yin,et al.  Comparative Study of CNN and RNN for Natural Language Processing , 2017, ArXiv.

[14]  I. Guyon,et al.  Handwritten digit recognition: applications of neural network chips and automatic learning , 1989, IEEE Communications Magazine.

[15]  Avigdor Gal,et al.  Non-binary evaluation measures for big data integration , 2018, The VLDB Journal.

[16]  Avigdor Gal Uncertain Schema Matching , 2019, Encyclopedia of Big Data Technologies.

[17]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[18]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[19]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[20]  Bill Tomlinson,et al.  Who are the crowdworkers?: shifting demographics in mechanical turk , 2010, CHI Extended Abstracts.

[21]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[22]  S. Micali,et al.  Priority queues with variable priority and an O(EV log V) algorithm for finding a maximal weighted matching in general graphs , 1982, FOCS 1982.

[23]  Karl Aberer,et al.  Completeness and Ambiguity of Schema Cover , 2013, OTM Conferences.

[24]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[25]  Avigdor Gal,et al.  Learning to Rerank Schema Matches , 2021, IEEE Transactions on Knowledge and Data Engineering.

[26]  Avigdor Gal,et al.  A framework for modeling and evaluating automatic semantic reconciliation , 2005, The VLDB Journal.

[27]  Tova Milo,et al.  Next-Step Suggestions for Modern Interactive Data Analysis Platforms , 2018, KDD.

[28]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[29]  Les E. Atlas,et al.  Recurrent neural networks and robust time series prediction , 1994, IEEE Trans. Neural Networks.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[32]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[33]  Avigdor Gal,et al.  What Type of a Matcher Are You?: Coordination of Human and Algorithmic Matchers , 2018, HILDA@SIGMOD.

[34]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[35]  Rahul Sukthankar,et al.  MatchNet: Unifying feature and metric learning for patch-based matching , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Avigdor Gal,et al.  Managing Uncertainty in Schema Matcher Ensembles , 2007, SUM.

[37]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[38]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[39]  Avigdor Gal,et al.  Heterogeneous Data Integration by Learning to Rerank Schema Matches , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[40]  Jayant Madhavan,et al.  Corpus-Based Knowledge Representation , 2003, IJCAI.

[41]  Steffen Rendle,et al.  Factorization Machines with libFM , 2012, TIST.

[42]  Lars Schmidt-Thieme,et al.  BPR: Bayesian Personalized Ranking from Implicit Feedback , 2009, UAI.

[43]  John Mylopoulos,et al.  A Semantic Approach to XML-based Data Integration , 2001, ER.