QuaIIe: A Data Quality Assessment Tool for Integrated Information Systems

Data is central to decision-making in enterprises and organizations (e.g., smart factories and predictive maintenance), as well as in private life (e.g., booking platforms). Especially in artificial intelligence applications, like self-driving cars, trust in data-driven decisions depends directly on the quality of the underlying data. Therefore, it is essential to know the quality of the data in order to assess the trustworthiness and to reduce the uncertainty of the derived decisions. In this paper, we present QuaIIe (Quality Assessment for Integrated Information Environments, pronounced [’kvAl@]), a Java-based tool for the domain-independent ad-hoc measurement of an information system’s quality. QuaIIe is based on a holistic approach to measure both schema and data quality and covers the dimensions accuracy, correctness, completeness, pertinence, minimality, and normalization. The quality measurements are presented as machineand human-readable reports, which can be generated periodically in order to observe how data quality evolves. In contrast to most existing data quality tools, QuaIIe does not necessarily require domain knowledge and thus offers an initial ad-hoc estimation of an information system’s quality.

[1]  Gottfried Vossen Data models, database languages and database management systems , 1990, International computer science series.

[2]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[4]  Felix Naumann,et al.  Completeness of integrated information sources , 2004, Inf. Syst..

[5]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[6]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[7]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[8]  Chengfei Liu,et al.  Discover Dependencies from Data—A Review , 2012, IEEE Transactions on Knowledge and Data Engineering.

[9]  Felix Naumann,et al.  Quality-driven Integration of Heterogenous Information Systems , 1999, VLDB.

[10]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[11]  Cihan Varol,et al.  An Overview of Open Source Data Quality Tools , 2010, IKE.

[12]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[13]  Blackford Middleton,et al.  Measuring the quality of medical records: a method for comparing completeness and correctness of clinical encounter data , 2001, AMIA.

[14]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[15]  José Barateiro,et al.  A Survey of Data Quality Tools , 2005, Datenbank-Spektrum.

[16]  Sebastian Link,et al.  Data Quality: The Role of Empiricism , 2018, SGMD.

[17]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[18]  Ana Carolina Salgado,et al.  Information Quality Measurement in Data Integration Schemas , 2007, QDB.

[19]  Andreas Bitterer Magic Quadrant for Data Quality Tools , 2011 .

[20]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[21]  Wolfram Wöß,et al.  Semi-Automatically Generated Hybrid Ontologies for Information Integration , 2015, SEMANTiCS.

[22]  Monique Snoeck,et al.  Towards a Precise Definition of Data Accuracy and a Justification for its Measure , 2016, ICIQ.