Efficient top-k count queries over imprecise duplicates

We propose efficient techniques for processing various Top-K count queries on data with noisy duplicates. Our method differs from existing work on duplicate elimination in two significant ways: First, we dedup on the fly only the part of the data needed for the answer --- a requirement in massive and evolving sources where batch deduplication is expensive. The non-local nature of the problem of partitioning data into duplicate groups, makes it challenging to filter only those tuples forming the K largest groups. We propose a novel method of successively collapsing and pruning records which yield an order of magnitude reduction in running time compared to deduplicating the entire data first. Second, we return multiple high scoring answers to handle situations where it is impossible to resolve if two records are indeed duplicates of each other. Since finding even the highest scoring deduplication is NP-hard, the existing approach is to deploy one of many variants of score-based clustering algorithms which do not easily generalize to finding multiple groupings. We model deduplication as a segmentation of a linear embedding of records and present a polynomial time algorithm for finding the R highest scoring answers. This method closely matches the accuracy of an exact exponential time algorithm on several datasets.

[1]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[2]  Surajit Chaudhuri,et al.  Example-driven design of efficient record matching queries , 2007, VLDB.

[3]  Kevin Chen-Chuan Chang,et al.  Probabilistic top-k and ranking-aggregate queries , 2008, TODS.

[4]  Sugato Basu,et al.  Adaptive product normalization: using online learning for record linkage in comparison shopping , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[5]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[6]  Daphne Koller Structured Probabilistic Models: Bayesian Networks and Beyond , 1998, AAAI/IAAI.

[7]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[8]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[9]  David Harel,et al.  A Multi-scale Algorithm for the Linear Arrangement Problem , 2002, WG.

[10]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[11]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[12]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[13]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[14]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[15]  William W. Cohen,et al.  Learning to Match and Cluster Entity Names , 2001 .

[16]  Santosh S. Vempala,et al.  A divide-and-merge methodology for clustering , 2005, PODS '05.

[17]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[18]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[19]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[20]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[21]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[22]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[23]  Andrew McCallum,et al.  A unified approach for schema matching, coreference and canonicalization , 2008, KDD.

[24]  Sunita Sarawagi,et al.  Scaling up the ALIAS Duplicate Elimination System. , 2003, ICDE 2003.

[25]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.

[26]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[27]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[28]  Bin Wang,et al.  Cost-based variable-length-gram selection for string collections to support approximate queries efficiently , 2008, SIGMOD Conference.

[29]  Andrew McCallum,et al.  Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference , 2003, IIWeb.

[30]  Mikhail Bilenko,et al.  Learnable Similarity Functions and their Applications to Clustering and Record Linkage , 2004, AAAI.

[31]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[32]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[33]  Kevin Chen-Chuan Chang,et al.  Supporting ad-hoc ranking aggregates , 2006, SIGMOD Conference.

[34]  Teruko Mitamura,et al.  Language-independent Probabilistic Answer Ranking for Question Answering , 2007, ACL.

[35]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[36]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.