Web-Age Information Management

Real-life data are often dirty: inconsistent, inaccurate, incomplete, stale and duplicated. Dirty data have been a longstanding issue, and the prevalent use of Internet has been increasing the risks, in an unprecedented scale, of creating and propagating dirty data. Dirty data are reported to cost US industry billions of dollars each year. There is no reason to believe that the scale of the problem is any different in any other society that depends on information technology. With these comes the need for improving data quality, a topic as important as traditional data management tasks for coping with the quantity of the data. We aim to provide an overview of recent advances in the area of data quality, from theory to practical techniques. We promote a conditional dependency theory for capturing data inconsistencies, a new form of dynamic constraints for data deduplication, a theory of relative information completeness for characterizing incomplete data, and a data currency model for answering queries with current values from possibly stale data in the absence of reliable timestamps. We also discuss techniques for automatically discovering data quality rules, detecting errors in real-life data, and for correcting errors with performance guarantees. 1 Data Quality: An Overview Traditional database systems typically focus on the quantity of data, to support the creation, maintenance and use of large volumes of data. But such a database system may not find correct answers to our queries if the data in the database are “dirty”, i.e., when the data do not properly represent the real world entities to which they refer. To illustrate this, let us consider an employee relation residing in a database of a company, specified by the following schema: employee (FN, LN, CC, AC, phn, street, city, zip, salary, status) Here each tuple specifies an employee’s name (first name FN and last name LN), office phone (country code CC, area code AC, phone phn), office address (street, city, zip code), salary and marital status. An instance D0 of the employee schema is shown in Figure 1. Fan is supported in part by EPSRC EP/J015377/1, the RSE-NSFC Joint Project Scheme, the 973 Program 2012CB316200 and NSFC 61133002 of China. H. Gao et al. (Eds.): WAIM 2012, LNCS 7418, pp. 1–16, 2012. c © Springer-Verlag Berlin Heidelberg 2012

[1]  Philip S. Yu,et al.  Mining top-K large structural patterns in a massive network , 2011, Proc. VLDB Endow..

[2]  Chin-Wan Chung,et al.  A User Similarity Calculation Based on the Location for Social Network Services , 2011, DASFAA.

[3]  David G. Lowe,et al.  Scene modelling, recognition and tracking with invariant image features , 2004, Third IEEE and ACM International Symposium on Mixed and Augmented Reality.

[4]  Josef Kittler,et al.  Defect detection in random colour textures , 1996, Image Vis. Comput..

[5]  Philip S. Yu,et al.  Efficient Method for Maximizing Bichromatic Reverse Nearest Neighbor , 2009, Proc. VLDB Endow..

[6]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[7]  S. Muthukrishnan,et al.  Influence sets based on reverse nearest neighbor queries , 2000, SIGMOD '00.

[8]  Jin Huang,et al.  Top-k most influential locations selection , 2011, CIKM '11.

[9]  Divyakant Agrawal,et al.  Discovery of Influence Sets in Frequently Updated Databases , 2001, VLDB.

[10]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[11]  Yufei Tao,et al.  Progressive computation of the min-dist optimal-location query , 2006, VLDB.

[12]  Dan Lin,et al.  The Min-dist Location Selection Query , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[13]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[14]  Raymond Chi-Wing Wong,et al.  On Efficient Spatial Matching , 2007, VLDB.

[15]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[16]  Yang Du,et al.  On Computing Top-t Most Influential Spatial Sites , 2005, VLDB.

[17]  Jitendra Malik,et al.  Contour and Texture Analysis for Image Segmentation , 2001, International Journal of Computer Vision.

[18]  Wei Wang,et al.  Graph Database Indexing Using Structured Graph Decomposition , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[19]  Kyriakos Mouratidis,et al.  Capacity constrained assignment in spatial databases , 2008, SIGMOD Conference.

[20]  Andrea Baraldi,et al.  An investigation of the textural characteristics associated with gray level cooccurrence matrix statistical parameters , 1995, IEEE Transactions on Geoscience and Remote Sensing.

[21]  Lei Zou,et al.  DistanceJoin: Pattern Match Query In a Large Graph Database , 2009, Proc. VLDB Endow..

[22]  Haixun Wang,et al.  Efficient subgraph search over large uncertain graphs , 2011, Proc. VLDB Endow..

[23]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.