Dynamic Truth Discovery on Numerical Data

Truth discovery aims at obtaining the most credible information from multiple sources that provide noisy and conflicting values. Due to the ubiquitous existence of data conflict in practice, truth discovery has been attracting a lot of research attention recently. Unfortunately, existing truth discovery models all miss an important issue of truth discovery - the truth evolution problem. In many real-life scenarios, the latent true value often keeps changing dynamically over time instead of staying static. We study the dynamic truth discovery problem in the space of numerical truth discovery. This problem cannot be addressed by existing models because of the new challenges of capturing time-evolving source dependency in a continuous space as well as handling missing data on the fly. We propose a model named EvolvT for dynamic truth discovery on numerical data. With the hidden Markov framework, EvolvT captures three key aspects of dynamic truth discovery with a unified model: truth transition regularity, source quality, and source dependency. The most distinguishable feature of the modeling part of EvolvT is that it employs Kalman filtering to model truth evolution. As such, EvolvT not only can principally infer source dependency in a continuous space, but also can handle missing data in a natural way. We establish an expectation-maximization (EM) algorithm for parameter inference of EvolvT and present an efficient online version for the parameter inference procedure. Our experiments on real-world applications demonstrate its advantages over the state-of-the-art truth discovery approaches.

[1]  Heng Ji,et al.  Modeling Truth Existence in Truth Discovery , 2015, KDD.

[2]  Ge Yu,et al.  An Effective and Efficient Truth Discovery Framework over Data Streams , 2017, EDBT.

[3]  Heng Ji,et al.  FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation , 2015, KDD.

[4]  Bo Zhao,et al.  A Confidence-Aware Approach for Truth Discovery on Long-Tail Data , 2014, Proc. VLDB Endow..

[5]  Vaidy S. Sunderam,et al.  Truth Discovery for SpatioTemporal Events from Crowdsourced Data , 2017, Proc. VLDB Endow..

[6]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[7]  Dan Roth,et al.  Content-driven trust propagation framework , 2011, KDD.

[8]  Lu Su,et al.  A Truth Discovery Approach with Theoretical Guarantee , 2016, KDD.

[9]  Charu C. Aggarwal,et al.  Mining collective intelligence in diverse groups , 2013, WWW.

[10]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[11]  Jiawei Han,et al.  A Probabilistic Model for Estimating Real-valued Truth from Conflicting Sources , 2012 .

[12]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Zhaoran Wang,et al.  Sparse Principal Component Analysis for High Dimensional Multivariate Time Series , 2013, AISTATS.

[14]  Fenglong Ma,et al.  Online Truth Discovery on Time Series Data , 2018, SDM.

[15]  Teri A. Crosby,et al.  How to Detect and Handle Outliers , 1993 .

[16]  Beng Chin Ooi,et al.  Online data fusion , 2011, Proc. VLDB Endow..

[17]  Shaowen Wang,et al.  GeoBurst: Real-Time Local Event Detection in Geo-Tagged Tweet Streams , 2016, SIGIR.

[18]  Bo Zhao,et al.  On the Discovery of Evolving Truth , 2015, KDD.

[19]  Siem Jan Koopman,et al.  Time Series Analysis by State Space Methods , 2001 .

[20]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[21]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[22]  Heng Ji,et al.  Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach , 2017, EMNLP.

[23]  Taylor Cassidy,et al.  The Wisdom of Minority: Unsupervised Slot Filling Validation based on Multi-dimensional Truth-Finding , 2014, COLING.

[24]  Zhaoran Wang,et al.  High Dimensional Expectation-Maximization Algorithm: Statistical Optimization and Asymptotic Normality , 2014, 1412.8729.

[25]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[26]  Bo Zhao,et al.  Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation , 2014, SIGMOD Conference.

[27]  Ashwin Machanavajjhala,et al.  Information integration over time in unreliable and uncertain environments , 2012, WWW.

[28]  R. Shumway,et al.  AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[29]  Dan Roth,et al.  Knowing What to Believe (when you already know something) , 2010, COLING.

[30]  Douglas Thain,et al.  Towards Scalable and Dynamic Social Sensing Using A Distributed Computing Framework , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[31]  Gerhard Weikum,et al.  People on drugs: credibility of user statements in health communities , 2014, KDD.

[32]  Ming Yu,et al.  Provable Gaussian Embedding with One Observation , 2018, NeurIPS.

[33]  Fenglong Ma,et al.  Leveraging the Power of Informative Users for Local Event Detection , 2018, 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).