Towards a Data Quality Framework for Heterogeneous Data

Every industry has significant data output as a product of their working process, and with the recent advent of big data mining and integrated data warehousing it is the case for a robust methodology for assessing the quality for sustainable and consistent processing. In this paper a review is conducted on Data Quality (DQ) in multiple domains in order to propose connections between their methodologies. This critical review suggests that within the process of DQ assessment of heterogeneous data sets, not often are they treated as separate types of data in need of an alternate data quality assessment framework. We discuss the need for such a directed DQ framework and the opportunities that are foreseen in this research area and propose to address it through degrees of heterogeneity.

[1]  Feng Liu,et al.  Automatic Data Quality Control of Observations in Wireless Sensor Network , 2015, IEEE Geoscience and Remote Sensing Letters.

[2]  Sylvie Servigne,et al.  Managing Sensor Data Uncertainty: A Data Quality Approach , 2013, Int. J. Agric. Environ. Inf. Syst..

[3]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[4]  David Loshin,et al.  The Practitioner's Guide to Data Quality Improvement , 2010 .

[5]  Taghi M. Khoshgoftaar,et al.  Investigating Transfer Learners for Robustness to Domain Class Imbalance , 2016, 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA).

[6]  Carlo Batini,et al.  A Data Quality Methodology for Heterogeneous Data , 2011 .

[7]  David L. Banks,et al.  Data quality: A statistical perspective , 2006 .

[8]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[9]  Haixun Wang,et al.  Finding semantics in time series , 2011, SIGMOD '11.

[10]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[11]  Meina Song,et al.  Survey on data quality , 2012, 2012 World Congress on Information and Communication Technologies.

[12]  Roberto Boselli,et al.  A Model-Based Approach for Developing Data Cleansing Solutions , 2015, JDIQ.

[13]  Anja Klein Incorporating quality aspects in sensor data streams , 2007, PIKM '07.

[14]  Steffen Lamparter,et al.  Analysis of data quality issues in real-world industrial data , 2013 .

[15]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[16]  Martin J. Eppler,et al.  Conceptualizing Information Quality: A Review of Information Quality Frameworks from the Last Ten Years , 2000, IQ.

[17]  Felix Naumann,et al.  Assessing the Completeness of Sensor Data , 2006, DASFAA.

[18]  Ulrich Güntzer,et al.  Data Quality Mining - Making a Virute of Necessity , 2001, DMKD.

[19]  Hamidah Ibrahim,et al.  Data quality: A survey of data quality dimensions , 2012, 2012 International Conference on Information Retrieval & Knowledge Management.

[20]  Wolfgang Lehner,et al.  Representing Data Quality in Sensor Data Streaming Environments , 2009, JDIQ.

[21]  Qi Han,et al.  Quality-Aware Sensor Data Management , 2014 .

[22]  Silvia Miksch,et al.  A Taxonomy of Dirty Time-Oriented Data , 2012, CD-ARES.

[23]  Subbarao Kambhampati,et al.  BayesWipe: A Scalable Probabilistic Framework for Improving Data Quality , 2016, JDIQ.

[24]  William Q. Meeker,et al.  Early Detection of Reliability Problems Using Information From Warranty Databases , 2002, Technometrics.

[25]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[26]  José Barateiro,et al.  A Survey of Data Quality Tools , 2005, Datenbank-Spektrum.

[27]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[28]  Harald Piringer,et al.  Visplause: Visual Data Quality Assessment of Many Time Series Using Plausibility Checks , 2017, IEEE Transactions on Visualization and Computer Graphics.

[29]  Dario Papale,et al.  Observational Data Patterns for Time Series Data Quality Assessment , 2014, 2014 IEEE 10th International Conference on e-Science.

[30]  Yuefeng Li,et al.  A decision rule method for data quality assessment , 2010, ICIQ.

[31]  Ralf Gitzel,et al.  Data Quality in Time Series Data: An Experience Report , 2016, CBI.

[32]  Ian Davidson,et al.  A Flexible Framework for Projecting Heterogeneous Data , 2014, CIKM.

[33]  Anany Levitin,et al.  Quality dimensions of a conceptual view , 1995 .

[34]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[35]  Samira Si-Said Cherfi,et al.  Data Quality through Conceptual Model Quality - Reconciling Researchers and Practitioners through a Customizable Quality Model , 2009, ICIQ.

[36]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[37]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[38]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[39]  Benjamin T. Hazen,et al.  Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications , 2014 .

[40]  Richard Y. Wang,et al.  A product perspective on total data quality management , 1998, CACM.

[41]  Anany Levitin,et al.  The Notion of Data and Its Quality Dimensions , 1994, Inf. Process. Manag..

[42]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[43]  Martin Meckesheimer,et al.  Automatic outlier detection for time series: an application to sensor data , 2007, Knowledge and Information Systems.