Non-binary evaluation measures for big data integration

The evolution of data accumulation, management, analytics, and visualization has led to the coining of the term big data, which challenges the task of data integration. This task, common to any matching problem in computer science involves generating alignments between structured data in an automated fashion. Historically, set-based measures, based upon binary similarity matrices (match/non-match), have dominated evaluation practices of matching tasks. However, in the presence of big data, such measures no longer suffice. In this work, we propose evaluation methods for non-binary matrices as well. Non-binary evaluation is formally defined together with several new, non-binary measures using a vector space representation of matching outcome. We provide empirical analyses of the usefulness of non-binary evaluation and show its superiority over its binary counterparts in several problem domains.

[1]  William E. Winkler,et al.  Fast Record Linkage of Very Large Files in Support of Decennial and Administrative Records Projects , 2010 .

[2]  Zohra Bellahsene,et al.  Matching and Alignment: What Is the Cost of User Post-Match Effort? - (Short Paper) , 2011, OTM Conferences.

[3]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[4]  J. H. Torrie,et al.  Principles and procedures of statistics: McGraw-Hill Book Company, Inc. New York Toronto London. , 1960 .

[5]  Avigdor Gal,et al.  The Use of Machine-Generated Ontologies in Dynamic Information Seeking , 2001, CoopIS.

[6]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[7]  Remco M. Dijkman,et al.  The ICoP Framework: Identification of Correspondences between Process Models , 2010, CAiSE.

[8]  Sabine Maßmann,et al.  Instance Matching with COMA++ , 2007, BTW Workshops.

[9]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[10]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[11]  Avigdor Gal,et al.  In schema matching, even experts are human: Towards expert sourcing in schema matching , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[12]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[13]  Mark A. Musen,et al.  Mechanical turk as an ontology engineer?: using microtasks as a component of an ontology-engineering workflow , 2013, WebSci.

[14]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[15]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[16]  Avigdor Gal,et al.  Uncertain Schema Matching , 2011, Uncertain Schema Matching.

[17]  Arkadi Nemirovski,et al.  Robust optimization – methodology and applications , 2002, Math. Program..

[18]  Jérôme Euzenat,et al.  Semantic Precision and Recall for Ontology Alignment Evaluation , 2007, IJCAI.

[19]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[20]  Heiner Stuckenschmidt,et al.  Ontology Alignment Evaluation Initiative: Six Years of Experience , 2011, J. Data Semant..

[21]  Richi Nayak,et al.  XML Schema Element Similarity Measures: A Schema Matching Context , 2009, OTM Conferences.

[22]  John Mylopoulos,et al.  A Semantic Approach to XML-based Data Integration , 2001, ER.

[23]  Avigdor Gal,et al.  Managing Uncertainty in Schema Matcher Ensembles , 2007, SUM.

[24]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[25]  Amihai Motro,et al.  Autoplex: Automated Discovery of Content for Virtual Databases , 2001, CoopIS.

[26]  David K. Arrowsmith,et al.  Metric Spaces: Iteration and Application , 1986 .

[27]  Vipul Kashyap,et al.  Imprecise Answers in Distributed Environments: Estimation of Information Loss for Multi-Ontology Based Query Processing , 2000, Int. J. Cooperative Inf. Syst..

[28]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[29]  Eric Peukert,et al.  AMC - A framework for modelling and comparing matching systems as matching processes , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[30]  F. N. David,et al.  Principles and procedures of statistics. , 1961 .

[31]  R. Shepard Attention and the metric structure of the stimulus space. , 1964 .

[32]  Matteo Magnani,et al.  Schema Integration Based on Uncertain Semantic Mappings , 2005, ER.

[33]  Avigdor Gal,et al.  Schema matching prediction with applications to data source discovery and dynamic ensembling , 2013, The VLDB Journal.

[34]  Avigdor Gal,et al.  On the Stable Marriage of Maximum Weight Royal Couples , 2007 .

[35]  Daniel P. W. Ellis,et al.  A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures , 2004, Computer Music Journal.

[36]  D. Luenberger Optimization by Vector Space Methods , 1968 .

[37]  AnHai Doan,et al.  Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[38]  Eric Friedman,et al.  Active Learning for Smooth Problems , 2009, COLT.

[39]  Hamideh Afsarmanesh,et al.  Pay-As-You-Go Data Integration Using Functional Dependencies , 2012, CD-ARES.

[40]  Ehud Gudes,et al.  Abbreviation Expansion in Schema Matching and Web Integration , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[41]  Arnon Rosenthal,et al.  eTuner: tuning schema matching software using synthetic scenarios , 2007, The VLDB Journal.

[42]  Rainer Alt,et al.  IEEE/WIC/ACM International Conference on Web Intelligence , 2015, WI-IAT.

[43]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[44]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[45]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[46]  Avigdor Gal,et al.  Non-binary Evaluation for Schema Matching , 2012, ER.

[47]  Erhard Rahm,et al.  Schema Matching and Mapping , 2013, Schema Matching and Mapping.

[48]  Avigdor Gal,et al.  A framework for modeling and evaluating automatic semantic reconciliation , 2005, The VLDB Journal.