PatternFinder: Pattern discovery for truth discovery

Abstract Truth discovery methods infer truths from multiple sources. These methods usually resolve conflicts based on the information on the entity level. However, due to the existence of incompleteness and the difficulty in entity matching, the information on the individual entity is often insufficient. This motivates pattern discovery, which aims to mine useful patterns across entities from a global perspective. In this paper, we introduce pattern discovery for truth discovery and formulate it as an optimization problem. To solve such a problem, we propose an algorithm called PatternFinder that jointly and iteratively learns the variables. Additionally, we also propose an optimized grouping strategy to enhance its efficiency. Experimental results on simulated and real-world datasets demonstrate the advantage of the proposed methods, which outperform the state-of-the-art baselines in terms of both effectiveness and efficiency.

[1]  Hector Garcia-Molina,et al.  Pay-As-You-Go Entity Resolution , 2013, IEEE Transactions on Knowledge and Data Engineering.

[2]  Sylvie Ranwez,et al.  Truth selection for truth discovery models exploiting ordering relationship among values , 2018, Knowl. Based Syst..

[3]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[4]  Chang-Dong Wang,et al.  TW-Co-k-means: Two-level weighted collaborative k-means for multi-view clustering , 2018, Knowl. Based Syst..

[5]  Bo Zhao,et al.  A Confidence-Aware Approach for Truth Discovery on Long-Tail Data , 2014, Proc. VLDB Endow..

[6]  Wotao Yin,et al.  A Block Coordinate Descent Method for Regularized Multiconvex Optimization with Applications to Nonnegative Tensor Factorization and Completion , 2013, SIAM J. Imaging Sci..

[7]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[8]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[9]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[10]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2008, IEEE Trans. Knowl. Data Eng..

[11]  Paolo Papotti,et al.  The LLUNATIC Data-Cleaning Framework , 2013, Proc. VLDB Endow..

[12]  Jianzhong Li,et al.  AutoRepair: an automatic repairing approach over multi-source data , 2018, Knowledge and Information Systems.

[13]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[14]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[15]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[16]  AnHai Doan,et al.  Technical Perspective:: Toward Building Entity Matching Management Systems , 2016, SGMD.

[17]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[18]  David J. Ketchen,et al.  THE APPLICATION OF CLUSTER ANALYSIS IN STRATEGIC MANAGEMENT RESEARCH: AN ANALYSIS AND CRITIQUE , 1996 .

[19]  Dacheng Tao,et al.  Multi-View Intact Space Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Divesh Srivastava,et al.  Record linkage with uniqueness constraints and erroneous values , 2010, Proc. VLDB Endow..

[21]  Qing Liu,et al.  A probabilistic model for truth discovery with object correlations , 2019, Knowl. Based Syst..