Automated Identification of Errors in Data Sets

The paper presents an overview of the current research and methods applied to the problem of data cleansing. It presents a tool for automated data cleansing of data sets. The tool is designed to be domain independent and constitutes the first part in a proposed framework for automated data cleansing. Development of a tool to address the entire framework is the ultimate goal of the research. The paper addresses the data cleansing problem from a different perspective, that is identification of errors in single data sets. This research ties together data quality and data mining issues. Existing outlier detection methods are utilized: clustering, pattern identification, and statistical methods. Real-world data is used for experiments. The methods and results are presented and analyzed. Refinements of these methods and new methods are proposed to address the data cleansing problem. ∗ This research is supported in part by a grant from the Office of Naval Research. Maletic & Marcus Automated Identification of Errors in Data Sets Technical Report CS-00-02 2 2/2/2000

[1]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[2]  Richard Y. Wang,et al.  An Object-Oriented Implementation of Quality Data Products , 1993 .

[3]  Anany Levitin,et al.  A model of the data (life) cycles with application to quality , 1993, Inf. Softw. Technol..

[4]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[5]  Raymond T. Ng,et al.  A Unified Notion of Outliers: Properties and Computation , 1997, KDD.

[6]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[7]  Werner Krischer,et al.  The Data Analysis BriefBook , 1998 .

[8]  Dennis Shasha,et al.  An extensible Framework for Data Cleaning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[9]  M. I. Svanks Integrity analysis: methods for automating data quality assurance , 1988 .

[10]  Yiming Yang,et al.  Learning approaches for detecting and tracking news events , 1999, IEEE Intell. Syst..

[11]  Evangelos Simoudis,et al.  Using Recon for Data Cleaning , 1995, KDD.

[12]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[13]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[14]  Anany Levitin,et al.  The Notion of Data and Its Quality Dimensions , 1994, Inf. Process. Manag..

[15]  Ken Orr,et al.  Data quality and systems theory , 1998, CACM.

[16]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[17]  N. Mati,et al.  Discovering Informative Patterns and Data Cleaning , 1996 .

[18]  Richard W. Hamming,et al.  Coding and Information Theory , 2018, Feynman Lectures on Computation.

[19]  Giri Kumar Tayi,et al.  Methodology for allocating resources for data quality enhancement , 1989, Commun. ACM.

[20]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[21]  Ralph Kimball,et al.  Dealing with dirty data , 1996 .

[22]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[23]  Ronald J. Brachman,et al.  The Process of Knowledge Discovery in Databases , 1996, Advances in Knowledge Discovery and Data Mining.