On the Meaningfulness of “Big Data Quality” (Invited Paper)

In this paper, we discuss the application of concept of data quality to big data by highlighting how much complex is to define it in a general way. Already data quality is a multidimensional concept, difficult to characterize in precise definitions even in the case of well-structured data. Big data add two further dimensions of complexity: (i) being “very” source specific, and for this we adopt the interesting UNECE classification, and (ii) being highly unstructured and schema-less, often without golden standards to refer to or very difficult to access. After providing a tutorial on data quality in traditional contexts, we analyze big data by providing insights into the UNECE classification, and then, for each type of data source, we choose a specific instance of such a type (notably deep Web data, sensor-generated data, and Twitters/short texts) and discuss how quality dimensions can be defined in these cases. The overall aim of the paper is therefore to identify further research directions in the area of big data quality, by providing at the same time an up-to-date state of the art on data quality.

[1]  Yolanda Gil,et al.  Towards content trust of web resources , 2006, WWW '06.

[2]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[3]  Daniel Essin,et al.  Big Data: The Next Big Thing for EHR? , 2014 .

[4]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[5]  Wilhelm Burger,et al.  Digital Image Processing - An Algorithmic Introduction using Java , 2008, Texts in Computer Science.

[6]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[7]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[8]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[9]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[10]  S. Jay Samuels,et al.  Readability: Its Past, Present, and Future , 1988 .

[11]  R. Gunning The Technique of Clear Writing. , 1968 .

[12]  Christian Bizer,et al.  Quality-Driven Information Filtering- In the Context of Web-Based Information Systems , 2007 .

[13]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[14]  Shazia Wasim Sadiq,et al.  Data Quality in Web Information Systems , 2008, WISE.

[15]  Enrico Motta,et al.  A framework for evaluating semantic metadata , 2007, K-CAP '07.

[16]  J. Jenkins,et al.  Simplification of Flesch Reading Ease Formula. , 1951 .

[17]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[18]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[19]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[20]  Carlo Batini,et al.  The Many Faces of Information and their Impact on Information Quality , 2012, ICIQ.

[21]  Martin Hepp,et al.  Swiqa - a semantic web information quality assessment framework , 2011, ECIS.

[22]  Lalana Kagal,et al.  Rule-Based Trust Assessment on the Semantic Web , 2011, RuleML Europe.

[23]  Wenfei Fan,et al.  Foundations of Data Quality Management , 2012, Foundations of Data Quality Management.

[24]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[25]  Bo Zhao,et al.  Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation , 2014, SIGMOD Conference.

[26]  Joseph Moses Juran Juran on planning for quality , 1988 .

[27]  Divesh Srivastava,et al.  Characterizing and selecting fresh data sources , 2014, SIGMOD Conference.

[28]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[29]  Jeremy J. Carroll,et al.  Signing RDF Graphs , 2003, SEMWEB.

[30]  John A. Stankovic,et al.  Research Directions for the Internet of Things , 2014, IEEE Internet of Things Journal.

[31]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[32]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[33]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[34]  Serguei Endrikhovski,et al.  33.1: Invited Paper: Image Quality is FUN: Reflections on Fidelity, Usefulness and Naturalness , 2002 .

[35]  Harry Dexter Kitson The Mind of the Buyer; A Psychology of Selling , 2006 .

[36]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[37]  Schahram Dustdar,et al.  On the Evaluation of Quality of Context , 2008, EuroSSC.

[38]  Clement T. Yu,et al.  An interactive clustering-based approach to integrating source query interfaces on the deep Web , 2004, SIGMOD '04.

[39]  William H. DuBay The Principles of Readability. , 2004 .

[40]  DalviNilesh,et al.  An analysis of structured data on the web , 2012, VLDB 2012.

[41]  Jürgen Umbrich,et al.  An empirical survey of Linked Data conformance , 2012, J. Web Semant..

[42]  Haixun Wang,et al.  Short text understanding through lexical-semantic analysis , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[43]  Kotagiri Ramamohanarao,et al.  Proceedings of the 27th International Conference on Very Large Data Bases , 2001, VLDB 2001.

[44]  R. Payne,et al.  Songs of Humpback Whales , 1971, Science.

[45]  Christian Bizer,et al.  Sieve: linked data quality assessment and fusion , 2012, EDBT-ICDT '12.

[46]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[47]  Yu-Chee Tseng,et al.  Pervasive and Mobile Computing ( ) – Pervasive and Mobile Computing Review from Wireless Sensor Networks towards Cyber Physical Systems , 2022 .

[48]  J. Chall,et al.  Readability revisited : the new Dale-Chall readability formula , 1995 .

[49]  Ashwin Machanavajjhala,et al.  An Analysis of Structured Data on the Web , 2012, Proc. VLDB Endow..

[50]  E A Smith,et al.  Automated readability index. , 1967, AMRL-TR. Aerospace Medical Research Laboratories.

[51]  Weisong Shi,et al.  Consistency-driven data quality management of networked sensor systems , 2008, J. Parallel Distributed Comput..