From Data Fusion to Knowledge Fusion

The task of data fusion is to identify the true values of data items (e.g., the true date of birth for Tom Cruise) among multiple observed values drawn from different sources (e.g., Web sites) of varying (and unknown) reliability. A recent survey [20] has provided a detailed comparison of various fusion methods on Deep Web data. In this paper, we study the applicability and limitations of different fusion techniques on a more challenging problem: knowledge fusion. Knowledge fusion identifies true subject-predicate-object triples extracted by multiple information extractors from multiple information sources. These extractors perform the tasks of entity linkage and schema alignment, thus introducing an additional source of noise that is quite different from that traditionally considered in the data fusion literature, which only focuses on factual errors in the original sources. We adapt state-of-the-art data fusion techniques and apply them to a knowledge base with 1.6B unique knowledge triples extracted by 12 extractors from over 1B Web pages, which is three orders of magnitude larger than the data sets used in previous data fusion papers. We show great promise of the data fusion approaches in solving the knowledge fusion problem, and suggest interesting research directions through a detailed error analysis of the methods.

[1]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[2]  Lorenzo Blanco,et al.  Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources , 2010, CAiSE.

[3]  Xiaoxin Yin,et al.  Semi-supervised truth discovery , 2011, WWW.

[4]  Dan Roth,et al.  Latent credibility analysis , 2013, WWW.

[5]  Fabian M. Suchanek,et al.  AMIE: association rule mining under incomplete evidence in ontological knowledge bases , 2013, WWW.

[6]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[7]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[8]  Charu C. Aggarwal,et al.  Mining collective intelligence in diverse groups , 2013, WWW.

[9]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[10]  References , 1971 .

[11]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[12]  Ashwin Machanavajjhala,et al.  An Analysis of Structured Data on the Web , 2012, Proc. VLDB Endow..

[13]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[14]  Erhard Rahm,et al.  Schema Matching and Mapping , 2013, Schema Matching and Mapping.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[17]  Divesh Srivastava,et al.  Fusing data with correlations , 2014, SIGMOD Conference.

[18]  Divesh Srivastava,et al.  Global detection of complex copying relationships between sources , 2010, Proc. VLDB Endow..

[19]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[20]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[21]  Rahul Gupta,et al.  Biperpedia: An Ontology for Search Applications , 2014, Proc. VLDB Endow..

[22]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[23]  Oren Etzioni,et al.  Modeling Missing Data in Distant Supervision for Information Extraction , 2013, TACL.

[24]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[25]  Andrew McCallum,et al.  Assessing confidence of knowledge base content with an experimental study in entity resolution , 2013, AKBC '13.

[26]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[27]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[28]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[29]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[30]  Beng Chin Ooi,et al.  Online data fusion , 2011, Proc. VLDB Endow..

[31]  Dan Roth,et al.  Knowing What to Believe (when you already know something) , 2010, COLING.

[32]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[33]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[34]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[35]  Jiawei Han,et al.  A Probabilistic Model for Estimating Real-valued Truth from Conflicting Sources , 2012 .

[36]  Christopher Ré,et al.  Elementary: Large-Scale Knowledge-Base Construction via Machine Learning and Statistical Inference , 2012, Int. J. Semantic Web Inf. Syst..

[37]  Dan Roth,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Making Better Informed Trust Decisions with Generalized Fact-Finding , 2022 .