Dynamical order construction in data fusion

Fusion functions based on order relations are formalized.It is pointed out that an appropriate order relation is not always at hand.The DOC algorithm to construct an appropriate order relation dynamically, is provided.Selection strategies are discussed.A thorough experimental evaluation shows the benefits of the proposed techniques. A crucial operation in the maintenance of data quality in relational databases is to remove tuples that mutually describe the same entity (i.e., duplicate tuples) and to replace them with a tuple that minimizes information loss. A function that combines multiple tuples into one is called a fusion function. In this paper, we investigate fusion functions for attributes of which the values can be sorted by means of an order relation that reflects a notion of generality. It is shown that providing such an order relation a priori, let alone keeping it up-to-date, is a costly operation. Therefore, the Dynamical Order Construction (DOC) algorithm is proposed that constructs an order relation in an automated fashion upon inspecting the data that need to be fused. Such order relations can be immediately deployed in a framework of selectional fusion functions, which are fusion functions that adopt the sort-and-select principle. These fusion functions are investigated closely in terms of their selection strategies. An experimental evaluation of our method shows the influence of the parameters and the benefit with respect to using a fixed and predefined taxonomy.

[1]  Ashwin Machanavajjhala,et al.  Network sampling , 2013, KDD.

[2]  Jens Bleiholder,et al.  Data fusion and conflict resolution in integrated information systems , 2010 .

[3]  Stephen Warshall,et al.  A Theorem on Boolean Matrices , 1962, JACM.

[4]  Alberto O. Mendelzon,et al.  Knowledge Base Merging by Majority , 1999 .

[5]  Felix Naumann,et al.  Declarative Data Fusion , 2005 .

[6]  M. Kendall,et al.  The Problem of $m$ Rankings , 1939 .

[7]  M. Tamer Özsu,et al.  Conflict tolerant queries in AURORA , 1999, Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No.PR00384).

[8]  Hongjun Lu,et al.  Discovering and reconciling value conflicts for numerical data integration , 2001, Inf. Syst..

[9]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[10]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) , 2006 .

[11]  Arnon Rosenthal,et al.  Outerjoin simplification and reordering for query optimization , 1997, TODS.

[12]  S S Stevens,et al.  On the Theory of Scales of Measurement. , 1946, Science.

[13]  César A. Galindo-Legaria,et al.  Outerjoins as disjunctions , 1994, SIGMOD '94.

[14]  Felix Naumann,et al.  Subsumption and complementation as data fusion operators , 2010, EDBT '10.

[15]  Henry S. Warren,et al.  A modification of Warshall's algorithm for the transitive closure of binary relations , 1975, Commun. ACM.

[16]  Cheng-Hsin Hsu,et al.  Ontology construction for information classification , 2006, Expert Syst. Appl..

[17]  Haixun Wang,et al.  Automatic taxonomy construction from keywords , 2012, KDD.

[18]  Amihai Motro,et al.  Utility-based resolution of data inconsistencies , 2004, IQIS '04.

[19]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[20]  Felix Naumann,et al.  Automatic Data Fusion with HumMer , 2005, VLDB.

[21]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[22]  A. Tversky Features of Similarity , 1977 .

[23]  Felix Naumann,et al.  Reach for gold: An annealing standard to evaluate duplicate detection results , 2014, JDIQ.

[24]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[25]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[26]  Divesh Srivastava,et al.  Sailing the Information Ocean with Awareness of Currents: Discovery and Application of Source Dependence , 2009, CIDR.

[27]  Felix Naumann,et al.  Declarative Data Fusion - Syntax, Semantics, and Implementation , 2005, ADBIS.

[28]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[29]  Amihai Motro,et al.  Fusionplex: resolution of data inconsistencies in the integration of heterogeneous information sources , 2006, Inf. Fusion.

[30]  Yau-Hwang Kuo,et al.  Automated ontology construction for unstructured text documents , 2007, Data & Knowledge Engineering.

[31]  Felix Naumann,et al.  Eliminating NULLs with Subsumption and Complementation , 2011, IEEE Data Eng. Bull..

[32]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[33]  Heiner Stuckenschmidt,et al.  Ontology-Based Integration of Information - A Survey of Existing Approaches , 2001, OIS@IJCAI.

[34]  G. De Tre,et al.  Dynamical construction of binary relations in coreference detection , 2012, 2012 Annual Meeting of the North American Fuzzy Information Processing Society (NAFIPS).

[35]  Antoon Bronselaer,et al.  A framework for multiset merging , 2012, Fuzzy Sets Syst..

[36]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[37]  Felix Naumann,et al.  Data profiling revisited , 2014, SGMD.

[38]  Esko Nuutila,et al.  Efficient transitive closure computation in large digraphs , 1995 .

[39]  Sameem Abdulkareem,et al.  An Ontology-based Approach for Resolving Semantic Schema Conflicts in the Extraction and Integration of Query-based Information from Heterogeneous Web Data Sources , 2022 .

[40]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[41]  Felix Naumann,et al.  Data Fusion in Three Steps : Resolving Inconsistencies at Schema-, Tuple-, and Value-lvel , 2006 .

[42]  R. Yager ON THE THEORY OF BAGS , 1986 .

[43]  Alexander Rybalov,et al.  Noncommutative self-identity aggregation , 1997, Fuzzy Sets Syst..

[44]  Sergio Greco,et al.  Integrating and Managing Conflicting Data , 2001, Ershov Memorial Conference.

[45]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[46]  Yannis Papakonstantinou,et al.  Object Fusion in Mediator Systems , 1996, VLDB.

[47]  Hongjun Lu,et al.  Discovering and Reconciling Semantic Conflicts: A Data Mining Perspective , 1997, DS-7.

[48]  Sudha Ram,et al.  Semantic conflict resolution ontology (SCROL): an ontology for detecting and resolving data and schema-level semantic conflicts , 2004, IEEE Transactions on Knowledge and Data Engineering.

[49]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[50]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[51]  Aris M. Ouksel,et al.  A classification of semantic conflicts in heterogeneous database systems , 1995, J. Organ. Comput..

[52]  Felix Naumann,et al.  Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies , 2006, IEEE Data Eng. Bull..

[53]  Ahmed A. Rafea,et al.  TextOntoEx: Automatic ontology construction from natural English text , 2008, Expert Syst. Appl..

[54]  Felix Naumann,et al.  Adaptive Windows for Duplicate Detection , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[55]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.