SLiMFast: Guaranteed Results for Data Fusion and Source Reliability

We focus on data fusion, i.e., the problem of unifying conflicting data from data sources into a single representation by estimating the source accuracies. We propose SLiMFast, a framework that expresses data fusion as a statistical learning problem over discriminative probabilistic models, which in many cases correspond to logistic regression. In contrast to previous approaches that use complex generative models, discriminative models make fewer distributional assumptions over data sources and allow us to obtain rigorous theoretical guarantees. Furthermore, we show how SLiMFast enables incorporating domain knowledge into data fusion, yielding accuracy improvements of up to 50% over state-of-the-art baselines. Building upon our theoretical results, we design an optimizer that obviates the need for users to manually select an algorithm for learning SLiMFast's parameters. We validate our optimizer on multiple real-world datasets and show that it can accurately predict the learning algorithm that yields the best data fusion results.

[1]  Bo Zhao,et al.  A Survey on Truth Discovery , 2015, SKDD.

[2]  Jeffrey T. Hancock,et al.  Linguistic Obfuscation in Fraudulent Science , 2016 .

[3]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[4]  Dan Roth,et al.  Latent credibility analysis , 2013, WWW.

[5]  Christopher Ré,et al.  DimmWitted: A Study of Main-Memory Statistical Analytics , 2014, Proc. VLDB Endow..

[6]  Bo Zhao,et al.  A Confidence-Aware Approach for Truth Discovery on Long-Tail Data , 2014, Proc. VLDB Endow..

[7]  Richard G. Baraniuk,et al.  A Probabilistic Theory of Deep Learning , 2015, ArXiv.

[8]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[9]  Julien Mairal,et al.  Optimization with Sparsity-Inducing Penalties , 2011, Found. Trends Mach. Learn..

[10]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[11]  GetoorLise,et al.  Hinge-loss Markov random fields and probabilistic soft logic , 2017 .

[12]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[13]  Xi Chen,et al.  Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing , 2014, J. Mach. Learn. Res..

[14]  Lise Getoor,et al.  A short introduction to probabilistic soft logic , 2012, NIPS 2012.

[15]  Eric Brill,et al.  Improving web search ranking by incorporating user behavior information , 2006, SIGIR.

[16]  Bo Zhao,et al.  Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation , 2014, SIGMOD Conference.

[17]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[18]  Aditya G. Parameswaran,et al.  Evaluating the crowd with confidence , 2013, KDD.

[19]  Wei Zhang,et al.  Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources , 2015, Proc. VLDB Endow..

[20]  Divesh Srivastava,et al.  Finding Quality in Quantity: The Challenge of Discovering Valuable Sources for Integration , 2015, CIDR.

[21]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[22]  Ming-Wei Chang,et al.  Unified Expectation Maximization , 2012, NAACL.

[23]  Aditya G. Parameswaran,et al.  Comprehensive and reliable crowd assessment algorithms , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[24]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[25]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[26]  Naren Ramakrishnan,et al.  SourceSeer: Forecasting Rare Disease Outbreaks Using Multiple Data Sources , 2015, SDM.

[27]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, The VLDB Journal.

[28]  Aravind Srinivasan,et al.  Model-Based Forecasting of Significant Societal Events , 2015, IEEE Intelligent Systems.

[29]  Michael Stonebraker,et al.  Temporal Rules Discovery for Web Data Cleaning , 2015, Proc. VLDB Endow..

[30]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[31]  Bo Zhao,et al.  On the Discovery of Evolving Truth , 2015, KDD.

[32]  Wilfred Ng,et al.  Truth Discovery in Data Streams: A Single-Pass Probabilistic Approach , 2014, CIKM.

[33]  Stephen H. Bach,et al.  Hinge-Loss Markov Random Fields and Probabilistic Soft Logic , 2015, J. Mach. Learn. Res..

[34]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[35]  Wei Zhang,et al.  From Data Fusion to Knowledge Fusion , 2014, Proc. VLDB Endow..

[36]  Chao Gao,et al.  Minimax Optimal Convergence Rates for Estimating Ground Truth from Crowdsourced Labels , 2013, 1310.5764.

[37]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[38]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[39]  Dan Roth,et al.  Knowing What to Believe (when you already know something) , 2010, COLING.

[40]  Divesh Srivastava,et al.  Compact explanation of data fusion decisions , 2013, WWW.

[41]  Christopher Ré,et al.  DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference , 2012, VLDS.

[42]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[43]  Elena Console,et al.  Data Fusion , 2009, Encyclopedia of Database Systems.

[44]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[45]  Bin Bi,et al.  Iterative Learning for Reliable Crowdsourcing Systems , 2012 .

[46]  Christopher De Sa,et al.  Global Convergence of Stochastic Gradient Descent for Some Non-convex Matrix Problems , 2014, ICML.

[47]  Divesh Srivastava,et al.  Characterizing and selecting fresh data sources , 2014, SIGMOD Conference.

[48]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[49]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[50]  Xiaoxin Yin,et al.  Semi-supervised truth discovery , 2011, WWW.

[51]  Divesh Srivastava,et al.  Big Data Integration , 2015, Synthesis Lectures on Data Management.

[52]  Anirban Dasgupta,et al.  Aggregating crowdsourced binary ratings , 2013, WWW.