Assessing the Quality of Unstructured Data: An Initial Overview

In contrast to structured data, unstructured data such as texts, speech, videos and pictures do not come with a data model that enables a computer to use them directly. Nowadays, computers can interpret the knowledge encoded in unstructured data using methods from text analytics, image recognition and speech recognition. Therefore, unstructured data are used increasingly in decision-making processes. But although decisions are commonly based on unstructured data, data quality assessment methods for unstructured data are lacking. We consider data analysis pipelines built upon two types of data consumers, human consumers that usually come at the end of the pipeline and non-human / machine consumers (e.g., natural language processing modules such as part of speech tagger and named entity recognizer) that mainly work intermediate. We define data quality of unstructured data via (1) the similarity of the input data to the data expected by these consumers of unstructured data and via (2) the similarity of the input data to the data representing the real world. We deduce data quality dimensions from the elements in analytic pipelines for unstructured data and characterize them. Finally, we propose automatically measurable indicators for assessing the quality of unstructured text data and give hints towards an implementation.

[1]  Etienne Barnard,et al.  Factors that affect the accuracy of text-based language identification , 2012, Comput. Speech Lang..

[2]  Anany Levitin,et al.  The Notion of Data and Its Quality Dimensions , 1994, Inf. Process. Manag..

[3]  Carlo Batini,et al.  A Data Quality Methodology for Heterogeneous Data , 2011 .

[4]  George F. Foster,et al.  Confidence estimation for NLP applications , 2006, TSLP.

[5]  Walt Detmar Meurers,et al.  Short Answer Assessment: Establishing Links Between Research Strands , 2012, BEA@NAACL-HLT.

[6]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[7]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[8]  Yan Song,et al.  Entropy-based Training Data Selection for Domain Adaptation , 2012, COLING.

[9]  Jason R. C. Nurse,et al.  Information Quality and Trustworthiness: A Topical State−of−the−Art Review , 2011 .

[10]  Francesco Camastra,et al.  Machine Learning for Audio, Image and Video Analysis - Theory and Applications , 2007, Advanced Information and Knowledge Processing.

[11]  Debabrata Dey,et al.  Reassessing Data Quality for Information Products , 2010, Manag. Sci..

[12]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[13]  Peter Buneman,et al.  Data provenance – the foundation of data quality , 2010 .

[14]  Cheng Niu,et al.  Orthographic case restoration using supervised learning without manual annotation , 2004, Int. J. Artif. Intell. Tools.

[15]  Karin Hartl,et al.  Determing the Business Value of Business Intelligence with Data Mining Methods , 2015 .

[16]  Maurizio Vichi,et al.  Studies in Classification Data Analysis and knowledge Organization , 2011 .

[17]  Pekka Pääkkönen,et al.  Evaluating the Quality of Social Media Data in Big Data Architecture , 2015, IEEE Access.

[18]  Christina Feilmayr Decision Guidance for Optimizing Web Data Quality - A Recommendation Model for Completing Information Extraction Results , 2013, 2013 24th International Workshop on Database and Expert Systems Applications.

[19]  C. Ireland,et al.  DBKDA Panel: On the Quality of non-structured data. , 2012 .

[20]  Yangyong Zhu,et al.  The Challenges of Data Quality and Data Quality Assessment in the Big Data Era , 2015, Data Sci. J..

[21]  Laura Sebastian-Coleman,et al.  Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework , 2012 .

[22]  Laurent Lecornu,et al.  A Methodology to Evaluate Important Dimensions of Information Quality in Systems , 2015, ACM J. Data Inf. Qual..

[23]  Barry Smyth,et al.  Information quality dimensions for the social web , 2012, MEDES.

[24]  J. Fleiss,et al.  The measurement of interrater agreement , 2004 .

[25]  NaumannFelix,et al.  Reach for gold , 2014 .

[26]  Martin Schierle,et al.  Multilingual Knowledge-Based Concept Recognition in Textual Data , 2008, GfKl.

[27]  Iryna Gurevych,et al.  DKPro Similarity: An Open Source Framework for Text Similarity , 2013, ACL.

[28]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[29]  Daniel Sonntag,et al.  Assessing the Quality of Natural Language Text Data , 2004, GI Jahrestagung.

[30]  Ke-Jia Chen,et al.  Web article quality ranking based on web community knowledge , 2014, Computing.