Managing Data Quality of the Data Warehouse: A Chance-Constrained Programming Approach

To make informed decisions, managers establish data warehouses that integrate multiple data sources. However, the outcomes of the data warehouse-based decisions are not always satisfactory due to low data quality. Although many studies focused on data quality management, little effort has been made to explore effective data quality control strategies for the data warehouse. In this study, we propose a chance-constrained programming model that determines the optimal strategy for allocating the control resources to mitigate the data quality problems of the data warehouse. We develop a modified Artificial Bee Colony algorithm to solve the model. Our work contributes to the literature on evaluation of data quality problem propagation in data integration process and data quality control on the data sources that make up the data warehouse. We use a data warehouse in the healthcare organization to illustrate the model and the effectiveness of the algorithm.

[1]  InduShobha N. Chengalur-Smith,et al.  Sample-based quality estimation of query results in relational database environments , 2006, IEEE Transactions on Knowledge and Data Engineering.

[2]  Pingping Feng,et al.  Improving data quality during ERP implementation based on information product map , 2019, Enterp. Inf. Syst..

[3]  Richard Y. Wang,et al.  Journey to Data Quality , 2006 .

[4]  Marlon Dumas,et al.  BPMN Miner: Automated discovery of BPMN process models with hierarchical structure , 2016, Inf. Syst..

[5]  Patrick B. Ryan,et al.  Managing Data Quality for a Drug Safety Surveillance System , 2013, Drug Safety.

[6]  Chandra A. Poojari,et al.  Genetic Algorithm based technique for solving Chance Constrained Problems , 2008, Eur. J. Oper. Res..

[7]  Ramayya Krishnan,et al.  On Risk Management with Information Flows in Business Processes , 2013, Inf. Syst. Res..

[8]  Debabrata Dey,et al.  Data Quality of Query Results with Generalized Selection Conditions , 2013, Oper. Res..

[9]  Bernd Heinrich,et al.  Requirements for Data Quality Metrics , 2018, ACM J. Data Inf. Qual..

[10]  Yu-Liang Chi,et al.  An Object-oriented Quality Framework with Optimization Models for Managing Data Quality in Data Warehouse Applications , 2005 .

[11]  Liang Chen,et al.  An improved differential evolution algorithm based on suboptimal solution mutation , 2017, Int. J. Comput. Sci. Math..

[12]  Patrice Degoulet,et al.  The Georges Pompidou University Hospital Clinical Data Warehouse: A 8-years follow-up experience , 2017, Int. J. Medical Informatics.

[13]  W. Y. Szeto,et al.  An artificial bee colony algorithm for the capacitated vehicle routing problem , 2011, Eur. J. Oper. Res..

[14]  Varghese S. Jacob,et al.  Impact of the Union and Difference Operations on the Quality of Information Products , 2009, Inf. Syst. Res..

[15]  Varghese S. Jacob,et al.  Assessing Data Quality for Information Products: Impact of Selection, Projection, and Cartesian Product , 2004, Manag. Sci..

[16]  Harpreet S. Dhillon,et al.  Poisson Cluster Process Based Analysis of HetNets With Correlated User and Base Station Locations , 2016, IEEE Transactions on Wireless Communications.

[17]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[18]  Xiufeng Liu,et al.  CITIESData: a smart city data management framework , 2017, Knowledge and Information Systems.

[19]  Adir Even,et al.  Development and evaluation of a continuous-time Markov chain model for detecting and handling data currency declines , 2017, Decis. Support Syst..

[20]  Debabrata Dey,et al.  Reassessing Data Quality for Information Products , 2010, Manag. Sci..

[21]  Daya Gupta,et al.  Data quality improvement in data warehouse: a framework , 2017, Int. J. Data Anal. Tech. Strateg..

[22]  Thilini Ariyachandra,et al.  Data warehouse governance: best practices at Blue Cross and Blue Shield of North Carolina , 2004, Decis. Support Syst..

[23]  Girish H. Subramanian,et al.  Systems Dynamics-Based Modeling of Data Warehouse Quality , 2019, J. Comput. Inf. Syst..

[24]  Didier Pittet,et al.  Challenging the world: patient safety and health care-associated infection. , 2006, International journal for quality in health care : journal of the International Society for Quality in Health Care.

[25]  Ümit Sami Sakalli,et al.  A simulated annealing approach for reliability‐based chance‐constrained programming , 2014 .

[26]  Gordon H. Huang,et al.  An Inexact Chance-constrained Quadratic Programming Model for Stream Water Quality Management , 2009 .

[27]  Guohe Huang,et al.  A risk-based interactive multi-stage stochastic programming approach for water resources planning under dual uncertainties , 2016 .

[28]  Kin Keung Lai,et al.  The bullwhip effect on inventory: a perspective on information quality , 2017 .

[29]  Tibor Harkany,et al.  Quantitative Western blotting: Improving your data quality and reproducibility , 2015 .

[30]  Dervis Karaboga,et al.  A modified Artificial Bee Colony algorithm for real-parameter optimization , 2012, Inf. Sci..

[31]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[32]  Alessandro Ruggieri,et al.  A systematic literature review on total quality management critical success factors and the identification of new avenues of research , 2017 .

[33]  Gunasekaran Manogaran,et al.  A Gaussian process based big data processing framework in cluster computing environment , 2017, Cluster Computing.

[34]  Panos Kalnis,et al.  Improved suffix blocking for record linkage and entity resolution , 2018, Data Knowl. Eng..

[35]  Marijn Janssen,et al.  A Process Pattern Model for Tackling and Improving Big Data Quality , 2018, Inf. Syst. Frontiers.

[36]  Jocelyn G Dewitt,et al.  Development of a Data Warehouse at an Academic Health System: Knowing a Place for the First Time , 2005, Academic medicine : journal of the Association of American Medical Colleges.

[37]  Li Wang,et al.  An information integration and transmission model of multi-source data for product quality and safety , 2019, Inf. Syst. Frontiers.

[38]  Avigdor Gal,et al.  Multi-source uncertain entity resolution: Transforming holocaust victim reports into people , 2017, Inf. Syst..

[39]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[40]  Sumit Sarkar,et al.  A Framework for Reconciling Attribute Values from Multiple Data Sources , 2007, Manag. Sci..

[41]  Benjamin T. Hazen,et al.  Applying Control Chart Methods to Enhance Data Quality , 2014, Technometrics.

[42]  Ian Davidson,et al.  Data preparation using data quality matrices for classification mining , 2009, Eur. J. Oper. Res..

[43]  Samani A. Talab,et al.  Enhanced Extraction Clinical Data Technique to Improve Data Quality in Clinical Data Warehouse , 2015 .

[44]  Qi Liu,et al.  A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge , 2018, Inf. Syst. Frontiers.

[45]  Abderrazak Sebaa,et al.  Medical Big Data Warehouse: Architecture and System Design, a Case Study: Improving Healthcare Resources Distribution , 2018, Journal of Medical Systems.

[46]  Jose M. Framiñan,et al.  Production , Manufacturing and Logistics The effect of Inventory Record Inaccuracy in Information Exchange Supply Chains , 2015 .

[47]  Yi Wang,et al.  A data cleaning method for heterogeneous attribute fusion and record linkage , 2019, Int. J. Comput. Sci. Eng..

[48]  Roman Lukyanenko,et al.  Citizen Science: An Information Quality Research Frontier , 2019, Information Systems Frontiers.

[49]  Abdelmgeid A. Ali,et al.  Automated ETL Testing on the Data Quality of a Data Warehouse , 2015 .

[50]  Adir Even,et al.  Evaluating a model for cost-effective data quality management in a real-world CRM setting , 2010, Decis. Support Syst..

[51]  Frank W. Takes,et al.  The Effects of Data Quality on the Analysis of Corporate Board Interlock Networks , 2016, Inf. Syst..

[52]  Adir Even,et al.  Data quality assessment in context: A cognitive perspective , 2009, Decis. Support Syst..

[53]  A. Charnes,et al.  Chance-Constrained Programming , 1959 .

[54]  Giri Kumar Tayi,et al.  Enhancing data quality in data warehouse environments , 1999, CACM.