Graph repairing under neighborhood constraints

A broad class of data, ranging from similarity networks, workflow networks to protein networks, can be modeled as graphs with data values as vertex labels. Both vertex labels and neighbors could be dirty for various reasons such as typos or erroneous reporting of results in scientific experiments. Neighborhood constraints, specifying label pairs that are allowed to appear on adjacent vertices in the graph, are employed to detect and repair erroneous vertex labels and neighbors. In this paper, we study the problem of repairing vertex labels and neighbors to make graphs satisfy neighborhood constraints. Unfortunately, the problem is generally hard, which motivates us to devise approximation methods for repairing and identify interesting special cases (star and clique constraints) that can be efficiently solved. First, we propose several label repairing approximation algorithms including greedy heuristics, contraction method and an approach combining both. The performances of algorithms are also analyzed for the special case. Moreover, we devise a cubic-time constant-factor graph repairing algorithm with both label and neighbor repairs (given degree-bounded instance graphs). Our extensive experimental evaluation on real data demonstrates the effectiveness of eliminating frauds in several types of application networks.

[1]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[2]  Philip S. Yu,et al.  Fast Graph Pattern Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[3]  Andreas Wagner,et al.  A statistical framework for combining and interpreting proteomic datasets , 2004, Bioinform..

[4]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[5]  Chao Tian,et al.  Keys for Graphs , 2015, Proc. VLDB Endow..

[6]  Hong Cheng,et al.  Efficient Determination of Distance Thresholds for Differential Dependencies , 2014, IEEE Transactions on Knowledge and Data Engineering.

[7]  Jeffrey Xu Yu,et al.  Finding maximal cliques in massive networks by H*-graph , 2010, SIGMOD Conference.

[8]  Lise Getoor,et al.  Entity Resolution in Graphs , 2005 .

[9]  Avishek Saha,et al.  Metric Functional Dependencies , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[10]  Nobutaka Suzuki Finding an optimum edit script between an XML document and a DTD , 2005, SAC '05.

[11]  Jef Wijsen,et al.  Database repairing using updates , 2005, TODS.

[12]  Laks V. S. Lakshmanan,et al.  On approximating optimum repairs for functional dependency violations , 2009, ICDT '09.

[13]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[14]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..

[15]  Sourav S. Bhowmick,et al.  GBLENDER: towards blending visual query formulation and query processing in graph databases , 2010, SIGMOD Conference.

[16]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, The VLDB Journal.

[17]  Hong Cheng,et al.  Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[18]  Jianmin Wang,et al.  Cleaning structured event logs: A graph repair approach , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[19]  Steven Minton,et al.  Solving Large-Scale Constraint-Satisfaction and Scheduling Problems Using a Heuristic Repair Method , 1990, AAAI.

[20]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[21]  Jianzhong Li,et al.  Graph pattern matching , 2010, Proc. VLDB Endow..

[22]  Nagiza F. Samatova,et al.  From pull-down data to protein interaction networks and complexes with biological relevance. , 2008, Bioinformatics.

[23]  Lei Chen,et al.  Differential dependencies: Reasoning and discovery , 2011, TODS.

[24]  Wilfred Ng,et al.  Fg-index: towards verification-free query processing on graph databases , 2007, SIGMOD '07.

[25]  Lei Zou,et al.  Matching Heterogeneous Event Data , 2018, IEEE Trans. Knowl. Data Eng..

[26]  Robert Isele,et al.  Learning Expressive Linkage Rules using Genetic Programming , 2012, Proc. VLDB Endow..

[27]  Xin Wang,et al.  Incremental graph pattern matching , 2013, TODS.

[28]  Juan Miguel García-Gómez,et al.  BIOINFORMATICS APPLICATIONS NOTE Sequence analysis Manipulation of FASTQ data with Galaxy , 2005 .

[29]  Michel de Rougemont,et al.  Correctors for XML Data , 2004, XSym.

[30]  Irit Dinur,et al.  The importance of being biased , 2002, STOC '02.

[31]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[32]  Hong Cheng,et al.  Repairing Vertex Labels under Neighborhood Constraints , 2014, Proc. VLDB Endow..

[33]  Filippo Furfaro,et al.  Querying and repairing inconsistent numerical databases , 2010, TODS.

[34]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[35]  Jianmin Wang,et al.  Sequential Data Cleaning: A Statistical Approach , 2016, SIGMOD Conference.

[36]  Jan Chomicki,et al.  Minimal-change integrity maintenance using tuple deletions , 2002, Inf. Comput..

[37]  Jianmin Wang,et al.  Cleaning timestamps with temporal constraints , 2016, The VLDB Journal.

[38]  Philip S. Yu,et al.  Matching heterogeneous events with patterns , 2014, 2014 IEEE 30th International Conference on Data Engineering.