论文信息 - Automated Identification of Errors in Data Sets

Automated Identification of Errors in Data Sets

The paper presents an overview of the current research and methods applied to the problem of data cleansing. It presents a tool for automated data cleansing of data sets. The tool is designed to be domain independent and constitutes the first part in a proposed framework for automated data cleansing. Development of a tool to address the entire framework is the ultimate goal of the research. The paper addresses the data cleansing problem from a different perspective, that is identification of errors in single data sets. This research ties together data quality and data mining issues. Existing outlier detection methods are utilized: clustering, pattern identification, and statistical methods. Real-world data is used for experiments. The methods and results are presented and analyzed. Refinements of these methods and new methods are proposed to address the data cleansing problem. ∗ This research is supported in part by a grant from the Office of Naval Research. Maletic & Marcus Automated Identification of Errors in Data Sets Technical Report CS-00-02 2 2/2/2000

Andrian Marcus | Jonathan I. Maletic | J. Maletic | Andrian Marcus

[1] Thomas Redman,et al. Data quality for the information age , 1996 .

[2] Richard Y. Wang,et al. An Object-Oriented Implementation of Quality Data Products , 1993 .

[3] Anany Levitin,et al. A model of the data (life) cycles with application to quality , 1993, Inf. Softw. Technol..

[4] Diane M. Strong,et al. Data quality in context , 1997, CACM.

[5] Raymond T. Ng,et al. A Unified Notion of Outliers: Properties and Computation , 1997, KDD.

[6] Veda C. Storey,et al. A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[7] Werner Krischer,et al. The Data Analysis BriefBook , 1998 .

[8] Dennis Shasha,et al. An extensible Framework for Data Cleaning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[9] M. I. Svanks. Integrity analysis: methods for automating data quality assurance , 1988 .

[10] Yiming Yang,et al. Learning approaches for detecting and tracking news events , 1999, IEEE Intell. Syst..

[11] Evangelos Simoudis,et al. Using Recon for Data Cleaning , 1995, KDD.

[12] Fionn Murtagh,et al. A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[13] Diane M. Strong,et al. Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[14] Anany Levitin,et al. The Notion of Data and Its Quality Dimensions , 1994, Inf. Process. Manag..

[15] Ken Orr,et al. Data quality and systems theory , 1998, CACM.

[16] Padhraic Smyth,et al. From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[17] N. Mati,et al. Discovering Informative Patterns and Data Cleaning , 1996 .

[18] Richard W. Hamming,et al. Coding and Information Theory , 2018, Feynman Lectures on Computation.

[19] Giri Kumar Tayi,et al. Methodology for allocating resources for data quality enhancement , 1989, Commun. ACM.

[20] Vic Barnett,et al. Outliers in Statistical Data , 1980 .

[21] Ralph Kimball,et al. Dealing with dirty data , 1996 .

[22] Thomas Redman,et al. The impact of poor data quality on the typical enterprise , 1998, CACM.

[23] Ronald J. Brachman,et al. The Process of Knowledge Discovery in Databases , 1996, Advances in Knowledge Discovery and Data Mining.