A Taxonomy of Dirty Data

Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining, and customer relationship management systems. A major problem that is only beginning to be recognized is that the data in data sources are often “dirty”. Broadly, dirty data include missing data, wrong data, and non-standard representations of the same data. The results of analyzing a database/data warehouse of dirty data can be damaging and at best be unreliable. In this paper, a comprehensive classification of dirty data is developed for use as a framework for understanding how dirty data arise, manifest themselves, and may be cleansed to ensure proper construction of data warehouses and accurate data analysis. The impact of dirty data on data mining is also explored.

[1]  B. Buckles,et al.  A fuzzy representation of data for relational databases , 1982 .

[2]  Markus Schneider,et al.  Spatial Data Types for Database Systems: Finite Resolution Geometry for Geographic Information Systems , 1997 .

[3]  Joseph Williams Tools for traveling data , 1997 .

[4]  W. H. Inmon,et al.  Building the data warehouse , 1992 .

[5]  Christopher R. Westphal,et al.  Data Mining Solutions: Methods and Tools for Solving Real-World Problems , 1998 .

[6]  Abraham Kandel,et al.  Implementing Imprecision in Information Systems , 1985, Inf. Sci..

[7]  Stéphane Bressan,et al.  Introduction to Database Systems , 2005 .

[8]  E. F. Codd,et al.  Extending the database relational model to capture more meaning , 1979, ACM Trans. Database Syst..

[9]  Ralph Kimball,et al.  The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing and Deploying Data Warehouses with CD Rom , 1998 .

[10]  Adnan Yazici,et al.  A complete axiomatization for fuzzy functional and multivalued dependencies in fuzzy database relations , 2001, Fuzzy Sets Syst..

[11]  Sumit Sarkar,et al.  A probabilistic relational model and algebra , 1996, TODS.

[12]  M. Carmen Garrido,et al.  Fuzzy division in fuzzy relational databases: an approach , 2001, Fuzzy Sets Syst..

[13]  Abraham Silberschatz,et al.  Database Systems Concepts , 1997 .

[14]  Richard T. Snodgrass,et al.  The TSQL2 Temporal Query Language , 1995 .

[15]  C. J. Date An introduction to database systems (7. ed.) , 1999 .

[16]  Giri Kumar Tayi,et al.  Enhancing data quality in data warehouse environments , 1999, CACM.

[17]  Gary G. Koch,et al.  Categorical Data Analysis Using The SAS1 System , 1995 .

[18]  W. H. Inmon,et al.  Data Warehouse Performance , 1998 .

[19]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[20]  P. Pfeifer,et al.  Modeling customer relationships as Markov chains , 2000 .

[21]  Sushil Jajodia,et al.  Temporal Databases: Research and Practice , 1998 .

[22]  Irving L. Traiger,et al.  Transactions and consistency in distributed database systems , 1982, TODS.

[23]  Christian S. Jensen,et al.  Temporal Databases: Research and Practice , 1998, Lecture Notes in Computer Science.

[24]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[25]  Derek Thompson,et al.  Fundamentals of spatial information systems , 1992, A.P.I.C. series.

[26]  Donald P. Ballou,et al.  Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems , 1985 .

[27]  W. H. Inmon,et al.  Building the data warehouse (2nd ed.) , 1996 .

[28]  Abraham Kandel,et al.  Information-theoretic fuzzy approach to data reliability and data mining , 2001, Fuzzy Sets Syst..

[29]  Michael Stonebraker,et al.  Object-Relational DBMSs: The Next Great Wave , 1995 .

[30]  Jungyun Seo,et al.  Classifying schematic and data heterogeneity in multidatabase systems , 1991, Computer.

[31]  Larry P. English Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits , 1999 .

[32]  Abraham Silberschatz,et al.  Database System Concepts , 1980 .

[33]  Won Kim,et al.  On resolving schematic heterogeneity in multidatabase systems , 1995, Distributed and Parallel Databases.

[34]  Alex Berson,et al.  Data Warehousing, Data Mining, and OLAP , 1997 .

[35]  Beng Chin Ooi,et al.  Efficient Query Processing in Geographic Information Systems , 1990, Lecture Notes in Computer Science.

[36]  Matteo Golfarelli,et al.  Designing the Data Warehouse: Key Steps and Crucial Issues , 1999 .

[37]  A. Yazici,et al.  A complete axiomatization for fuzzy functional and multivalued dependencies in fuzzy database relations , 1996, Proceedings of IEEE 5th International Fuzzy Systems.

[38]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.