Towards Configurable Composite Data Quality Assessment

The growing availability of data over the last decades has given rise to a number of successful technologies, ranging from data collection and storage infrastructures to hardware and software tools for efficient computation of analytics. This context, in principle, places a great demand on data quality. As a matter of fact, experience has shown that the open Web and other platforms hosting user-generated content or real-time data can provide little quality control at content production time. To address these challenges, our aim is to provide a general and configurable model for assessing data quality supporting task composition. In particular, we introduce a model characterized along the notion of matching, illustrating the issues that can be addressed by this approach with a concrete case study. We also identify and discuss challenges to be addressed in future research to strengthen this idea.

[1]  Qiang Yang,et al.  Predicting user activity level in social networks , 2013, CIKM.

[2]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[3]  Stefan Stieglitz,et al.  Towards more systematic Twitter analysis: metrics for tweeting activities , 2013 .

[4]  Anders Haug,et al.  The costs of poor data quality , 2011 .

[5]  Kathryn B. Laskey,et al.  Uncertainty Reasoning for the World Wide Web: Report on the URW3-XG Incubator Group , 2008, URSW.

[6]  Peter K. Schwab,et al.  An Architecture for Continuous Data Quality Monitoring in Medical Centers , 2015, MedInfo.

[7]  Sumeet Gupta,et al.  Classifying, Measuring, and Predicting Users’ Overall Active Behavior on Social Networking Sites , 2014, J. Manag. Inf. Syst..

[8]  Roberto Boselli,et al.  A Model-Based Approach for Developing Data Cleansing Solutions , 2015, JDIQ.

[9]  Hongjiang Xu,et al.  What Are the Most Important Factors for Accounting Information Quality and Their Impact on AIS Data Quality Outcomes? , 2015, JDIQ.

[10]  C. Tappert,et al.  A Survey of Binary Similarity and Distance Measures , 2010 .

[11]  P. Andrew Karplus,et al.  Linking Crystallographic Model and Data Quality , 2012, Science.

[12]  Hamidah Ibrahim,et al.  Data quality: A survey of data quality dimensions , 2012, 2012 International Conference on Information Retrieval & Knowledge Management.

[13]  Jens Lehmann,et al.  Test-driven evaluation of linked data quality , 2014, WWW.

[14]  Alon Orlitsky,et al.  On the query computation and verification of functions , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[15]  Jimeng Sun,et al.  Data and Analytics Challenges for a Learning Healthcare System , 2015, JDIQ.

[16]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[17]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[18]  Erich M. Nahum,et al.  Data Quality and Query Cost in Pervasive Sensing Systems , 2008, 2008 Sixth Annual IEEE International Conference on Pervasive Computing and Communications (PerCom).

[19]  Stuart E. Madnick,et al.  Overview and Framework for Data and Information Quality Research , 2009, JDIQ.

[20]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[21]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .

[22]  Gilles Halin,et al.  Challenges of Big Data in the Age of Building Information Modeling: A High-Level Conceptual Pipeline , 2015, CDVE.

[23]  Richard Y. Wang,et al.  A product perspective on total data quality management , 1998, CACM.

[24]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[25]  Erich M. Nahum,et al.  Data Quality and Query Cost in Wireless Sensor Networks , 2007, Fifth Annual IEEE International Conference on Pervasive Computing and Communications Workshops (PerComW'07).

[26]  Danette McGilvray,et al.  Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information TM , 2008 .

[27]  Andreas Harth,et al.  A Linked Data wrapper for CrunchBase , 2018, Semantic Web.

[28]  Ernesto Damiani,et al.  Which Role for an Ontology of Uncertainty? , 2008, URSW.

[29]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[30]  Christian Bizer,et al.  Sieve: linked data quality assessment and fusion , 2012, EDBT-ICDT '12.

[31]  Ernesto Damiani,et al.  A toward Framework for Generic Uncertainty Management , 2009, IFSA/EUSFLAT Conf..

[32]  Jeffrey F. Naughton,et al.  Predicting query execution time: Are optimizer cost models really unusable? , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[33]  CS 224 W Final Report Group 37 , 2012 .

[34]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[35]  Tilmann Rabl,et al.  Big Data Generation , 2012, WBDB.

[36]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[37]  Radko Mesiar,et al.  Aggregation functions: Construction methods, conjunctive, disjunctive and mixed classes , 2011, Inf. Sci..

[38]  Matteo Magnani,et al.  A Survey on Uncertainty Management in Data Integration , 2010, JDIQ.

[39]  J. Steiner,et al.  A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research. , 2012, Medical care.

[40]  Fatos Xhafa,et al.  Data as a Service (DaaS) for Sharing and Processing of Large Data Collections in the Cloud , 2013, 2013 Seventh International Conference on Complex, Intelligent, and Software Intensive Systems.

[41]  Larry Goldberg,et al.  The Decision Model: A Business Logic Framework Linking Business and Technology , 2009 .

[42]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[43]  Hilary Nixon,et al.  Comparing Modes of On-Board Transit Passenger Surveys: Assessing Trade-Offs between Data Quality and Cost , 2015 .

[44]  B. D. Finetti,et al.  La Logique de la Probabilite , 1937 .

[45]  Rajesh Parekh,et al.  Lessons and Challenges from Mining Retail E-Commerce Data , 2004, Machine Learning.

[46]  Thomas Foken,et al.  Corrections and data quality control , 2012 .

[47]  Fabien L. Gandon,et al.  Predicting SPARQL Query Performance , 2014, ESWC.

[48]  Donald P. Ballou,et al.  Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems , 1985 .

[49]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[50]  Marco Angelini,et al.  Big Data Semantics , 2018, Journal on Data Semantics.

[51]  Marco Valtorta,et al.  Towards a Method for Data Accuracy Assessment Utilizing a Bayesian Network Learning Algorithm , 2009, JDIQ.

[52]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[53]  James R. Evans,et al.  The management and control of quality , 1989 .

[54]  Paolo Nesi,et al.  Metadata Quality Assessment Tool for Open Access Cultural Heritage Institutional Repositories , 2013, ECLAP.

[55]  Ann G. Green,et al.  Committing to Data Quality Review , 2014, Int. J. Digit. Curation.

[56]  J. Gassman,et al.  Data quality assurance, monitoring, and reporting. , 1995, Controlled clinical trials.

[57]  Yuxian Eugene Liang,et al.  Predicting investor funding behavior using crunchbase social network features , 2016, Internet Res..