The Effects and Interactions of Data Quality and Problem Complexity on Classification

Data quality remains a persistent problem in practice and a challenge for research. In this study we focus on the four dimensions of data quality noted as the most important to information consumers, namely accuracy, completeness, consistency, and timeliness. These dimensions are of particular concern for operational systems, and most importantly for data warehouses, which are often used as the primary data source for analyses such as classification, a general type of data mining. However, the definitions and conceptual models of these dimensions have not been collectively considered with respect to data mining in general or classification in particular. Nor have they been considered for problem complexity. Conversely, these four dimensions of data quality have only been indirectly addressed by data mining research. Using definitions and constructs of data quality dimensions, our research evaluates the effects of both data quality and problem complexity on generated data and tests the results in a real-world case. Six different classification outcomes selected from the spectrum of classification algorithms show that data quality and problem complexity have significant main and interaction effects. From the findings of significant effects, the economics of higher data quality are evaluated for a frequent application of classification and illustrated by the real-world case.

[1]  Tim Oates,et al.  The Effects of Training Set Size on Decision Tree Complexity , 1997, ICML.

[2]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[3]  Richard Y. Wang,et al.  Data Quality Assessment , 2002 .

[4]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[5]  Donald P. Ballou,et al.  Modeling Completeness versus Consistency Tradeoffs in Information Decision Contexts , 2003, IEEE Trans. Knowl. Data Eng..

[6]  Ganesan Shankaranarayanan,et al.  Supporting data quality management in decision-making , 2006 .

[7]  Diane M. Strong,et al.  AIMQ: a methodology for information quality assessment , 2002, Inf. Manag..

[8]  Craig W. Fisher,et al.  In Search Of An Accuracy Metric , 2007, MIT International Conference on Information Quality.

[9]  BlakeRoger,et al.  The Effects and Interactions of Data Quality and Problem Complexity on Classification , 2011 .

[10]  Adir Even,et al.  Utility-driven configuration of data quality in data repositories , 2007, Int. J. Inf. Qual..

[11]  Richard Y. Wang,et al.  Modeling Information Manufacturing Systems to Determine Information Product Quality Management Scien , 1998 .

[12]  M. Valtorta Learning Bayesian Networks from Inaccurate Data , 2004 .

[13]  Carlos Ordonez,et al.  Referential integrity quality metrics , 2008, Decis. Support Syst..

[14]  Alan R. Hevner,et al.  Integrated decision support systems: A data warehousing perspective , 2007, Decis. Support Syst..

[15]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[16]  Donald P. Ballou,et al.  Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems , 1985 .

[17]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[18]  David L. Banks,et al.  Data quality: A statistical perspective , 2006 .

[19]  Gordon B. Davis,et al.  Can Humans Detect Errors in Data? Impact of Base Rates, Incentives, and Goals , 1997, MIS Q..

[20]  Marcus Kaiser,et al.  A Procedure to Develop Metrics for Currency and its Application in CRM , 2009, JDIQ.

[21]  Padhraic Smyth,et al.  Business applications of data mining , 2002, CACM.

[22]  Y. Wua,et al.  A study on the cost of operational complexity in customer – supplier systems , 2006 .

[23]  Richard Y. Wang,et al.  Journey to Data Quality , 2006 .

[24]  F. F. Reichheld,et al.  Zero defections: quality comes to services. , 1990, Harvard business review.

[25]  Varghese S. Jacob,et al.  Assessing data quality for information products , 1999, ICIS.

[26]  Richard Y. Wang,et al.  Data Quality , 2000, Advances in Database Systems.

[27]  Amir Parssian,et al.  Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions , 2006, Decis. Support Syst..

[28]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[29]  Diane M. Strong,et al.  Information quality benchmarks: product and service performance , 2002, CACM.

[30]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[31]  Robin A. Dillard Using data quality measures in decision-making algorithms , 1992, IEEE Expert.

[32]  Jeanne G. Harris,et al.  Competing on Analytics: The New Science of Winning , 2007 .

[33]  Joffre Swait,et al.  Choice Environment, Market Complexity, and Consumer Behavior: A Theoretical and Empirical Approach for Incorporating Decision Complexity into Models of Consumer Choice , 2001 .

[34]  Ashutosh Tiwari,et al.  Computer assisted customer churn management: State-of-the-art and future trends , 2007, Comput. Oper. Res..

[35]  Mouzhi Ge,et al.  A Framework to Assess Decision Quality Using Information Quality Dimensions , 2006, ICIQ.

[36]  José Farinha,et al.  A Data Quality Metamodel Extension to CWM , 2007, APCCM.

[37]  Yu Cai,et al.  Supporting data quality management in decision-making , 2006, Decis. Support Syst..

[38]  Ron Kohavi,et al.  Emerging trends in business analytics , 2002, CACM.

[39]  Diane M. Strong,et al.  Process-Embedded Data Integrity , 2004, J. Database Manag..

[40]  Varghese S. Jacob,et al.  Assessing Data Quality for Information Products: Impact of Selection, Projection, and Cartesian Product , 2004, Manag. Sci..

[41]  Kate Smith-Miles,et al.  On learning algorithm selection for classification , 2006, Appl. Soft Comput..

[42]  Adir Even,et al.  Dual Assessment of Data Quality in Customer Databases , 2009, JDIQ.

[43]  Janet Efstathiou,et al.  A study on the cost of operational complexity in customer-supplier systems , 2007 .

[44]  Ray J. Hickey,et al.  Artificial Intelligence Noise modelling and evaluating learning from examples , 2003 .