Data Quality Problems beyond Consistency and Deduplication

Recent work on data quality has primarily focused on data repairing algorithms for improving data consistency and record matching methods for data deduplication. This paper accentuates several other challenging issues that are essential to developing data cleaning systems, namely, error correction with performance guarantees, unification of data repairing and record matching, relative information completeness, and data currency. We provide an overview of recent advances in the study of these issues, and advocate the need for developing a logical framework for a uniform treatment of these issues.

[1]  Jan Chomicki,et al.  Answer sets for consistent query answering in inconsistent databases , 2002, Theory and Practice of Logic Programming.

[2]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[3]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[4]  Ron van der Meyden,et al.  Logical Approaches to Incomplete Information: A Survey , 1998, Logics for Databases and Information Systems.

[5]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[6]  Amihai Motro,et al.  Integrity = validity + completeness , 1989, TODS.

[7]  R. Watson,et al.  Data Management , 1980, Bone Marrow Transplantation.

[8]  Jef Wijsen,et al.  Determining the Currency of Data , 2011, TODS.

[9]  Ron van der Meyden,et al.  The complexity of querying indefinite data about linearly ordered domains , 1992, J. Comput. Syst. Sci..

[10]  Wenfei Fan,et al.  Capturing missing tuples and missing values , 2010, PODS.

[11]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[12]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[13]  Jianzhong Li,et al.  The VLDB Journal manuscript No. (will be inserted by the editor) Dynamic Constraints for Record Matching , 2022 .

[14]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, The VLDB Journal.

[15]  Wenfei Fan,et al.  Power Based Performance and Capacity Estimation Models for Enterprise Information Systems. , 2011 .

[16]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[17]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[18]  Richard T. Snodgrass,et al.  Developing Time-Oriented Database Applications in SQL , 1999 .

[19]  Daniel Linstedt,et al.  Master Data Management , 2015 .

[20]  Felix Naumann,et al.  DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[21]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[22]  Divesh Srivastava,et al.  Sailing the Information Ocean with Awareness of Currents: Discovery and Application of Source Dependence , 2009, CIDR.

[23]  Donald W. Miller,et al.  Missing Prenatal Records at a Birth Center: A Communication Problem Quantified , 2005, AMIA.

[24]  Alon Y. Halevy,et al.  Obtaining Complete Answers from Incomplete Databases , 1996, VLDB.

[25]  Shuai Ma,et al.  Interaction between Record Matching and Data Repairing , 2014, JDIQ.

[26]  Wenfei Fan,et al.  Relative information completeness , 2009, PODS.

[27]  Lei Chen,et al.  Discovering matching dependencies , 2009, CIKM.

[28]  Georg Gottlob,et al.  Closed World Databases Opened Through Null Values , 1988, VLDB.

[29]  Ahmed K. Elmagarmid,et al.  Guided data repair , 2011, Proc. VLDB Endow..

[30]  Ron van der Meyden The Complexity of Querying Indefinite Data about Linearly Ordered Domains , 1997, J. Comput. Syst. Sci..

[31]  Boris Otto,et al.  From Health Checks to the Seven Sisters: The Data Quality Journey at BT , 2009 .

[32]  Neil Immerman,et al.  Recognizing patterns in streams with imprecise timestamps , 2010, Proc. VLDB Endow..

[33]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[34]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[35]  Gösta Grahne,et al.  The Problem of Incomplete Information in Relational Databases , 1991, Lecture Notes in Computer Science.

[36]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[37]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[38]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[39]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[40]  Shuai Ma,et al.  Extending Dependencies with Conditions , 2007, VLDB.

[41]  Wenfei Fan,et al.  Foundations of Data Quality Management , 2012, Foundations of Data Quality Management.

[42]  Gunter Saake,et al.  Logics for databases and information systems , 1998 .

[43]  Wenfei Fan,et al.  Uniform Dependency Language for Improving Data Quality , 2011, IEEE Data Eng. Bull..