Exploiting data preparation to enhance mining and knowledge discovery

One of the major obstacles to using organizational data for mining and knowledge discovery is that, in most cases, it is not amenable for mining in its natural form. Using a data set from a large tertiary-care hospital, we provide strong empirical evidence that data enhancement by the introduction of new attributes, along with judicious aggregation of existing attributes, results in higher-quality knowledge discovery. Interestingly, we also found that there is a differential impact of data set enhancements on the performance of different data mining algorithms. We define and use several measures, including entropy, rule complexity and resonance, to evaluate the quality and usefulness of the knowledge discovered.

[1]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[2]  Witold Pedrycz,et al.  Data Mining Methods for Knowledge Discovery , 1998, IEEE Trans. Neural Networks.

[3]  Brian R. Gaines,et al.  Transforming Rules and Trees into Comprehensible Knowledge Structures , 2000 .

[4]  Christopher R. Westphal,et al.  Data Mining Solutions: Methods and Tools for Solving Real-World Problems , 1998 .

[5]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[6]  E N Weiss,et al.  An iterative estimation and validation procedure for specification of semi-Markov models with application to hospital patient flow. , 1982, Operations research.

[7]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[8]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[9]  Richard Y. Wang,et al.  A product perspective on total data quality management , 1998, CACM.

[10]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[11]  Giri Kumar Tayi,et al.  Enhancing data quality in data warehouse environments , 1999, CACM.

[12]  Jaideep Srivastava,et al.  Grouping Web page references into transactions for mining World Wide Web browsing patterns , 1997, Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop.

[13]  Ramayya Krishnan,et al.  Assessing data quality in accounting information systems , 1998, CACM.

[14]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[15]  M W Isken,et al.  A data mart for operations analysis. , 2001, Journal of healthcare information management : JHIM.

[16]  Robert B. Fetter,et al.  Diagnosis Related Groups: Understanding Hospital Performance , 1991 .

[17]  Ken Orr,et al.  Data quality and systems theory , 1998, CACM.

[18]  Lowery Jc,et al.  Design and validation of a critical care simulation model. , 1992 .

[19]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[20]  Barbara D. Klein,et al.  Data quality in neural network models: effect of error rate and magnitude of error on predictive accuracy , 1999 .

[21]  Anany Levitin,et al.  Data as a Resource: Properties, Implications, and Prescriptions , 1998 .

[22]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[23]  Robert Groth,et al.  Data Mining: Building Competitive Advantage , 1999 .