Completeness is one of the central criteria for data quality. Data completeness means the completeness of the data relative to the description of the objective world, which divided into the completeness of the values and tuples. This paper examines how to use multiple data sources to evaluate the record completeness of target data. However, if we want getting an accurate record completeness evaluation, we need to access all the data sources. But this will bring huge costs and is unrealistic. Therefore, this paper presents a signature-based randomized estimator for record completeness evaluation. The time to estimate record completeness is independent on the size of each data source. The basic idea of the random algorithm is to quickly estimate the record sets involved in the data sources and the target data set by linearly signing the signature for all data sources. The estimated time required is independent of the size of each data set, avoiding the huge overhead of the record pair matching. Experiments results on real data demonstrate the effectiveness and efficiency of the algorithm.
[1]
Serge Abiteboul,et al.
On the representation and querying of sets of possible worlds
,
1987,
SIGMOD '87.
[2]
Serge Abiteboul,et al.
On the Representation and Querying of Sets of Possible Worlds
,
1991,
Theor. Comput. Sci..
[3]
Ronald Fagin,et al.
Data exchange: semantics and query answering
,
2003,
Theor. Comput. Sci..
[4]
Salvatore J. Stolfo,et al.
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
,
1998,
Data Mining and Knowledge Discovery.
[5]
Tomasz Imielinski,et al.
Incomplete Information in Relational Databases
,
1984,
JACM.
[6]
Peter Christen,et al.
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication
,
2012,
IEEE Transactions on Knowledge and Data Engineering.
[7]
Wenfei Fan,et al.
Relative information completeness
,
2009,
PODS.
[8]
Serge Abiteboul,et al.
Complexity of answering queries using materialized views
,
1998,
PODS.