论文信息 - Assessing the Quality of Unstructured Data: An Initial Overview

Assessing the Quality of Unstructured Data: An Initial Overview

In contrast to structured data, unstructured data such as texts, speech, videos and pictures do not come with a data model that enables a computer to use them directly. Nowadays, computers can interpret the knowledge encoded in unstructured data using methods from text analytics, image recognition and speech recognition. Therefore, unstructured data are used increasingly in decision-making processes. But although decisions are commonly based on unstructured data, data quality assessment methods for unstructured data are lacking. We consider data analysis pipelines built upon two types of data consumers, human consumers that usually come at the end of the pipeline and non-human / machine consumers (e.g., natural language processing modules such as part of speech tagger and named entity recognizer) that mainly work intermediate. We define data quality of unstructured data via (1) the similarity of the input data to the data expected by these consumers of unstructured data and via (2) the similarity of the input data to the data representing the real world. We deduce data quality dimensions from the elements in analytic pipelines for unstructured data and characterize them. Finally, we propose automatically measurable indicators for assessing the quality of unstructured text data and give hints towards an implementation.

Cornelia Kiefer | C. Kiefer

[1] Etienne Barnard,et al. Factors that affect the accuracy of text-based language identification , 2012, Comput. Speech Lang..

[2] Anany Levitin,et al. The Notion of Data and Its Quality Dimensions , 1994, Inf. Process. Manag..

[3] Carlo Batini,et al. A Data Quality Methodology for Heterogeneous Data , 2011 .

[4] George F. Foster,et al. Confidence estimation for NLP applications , 2006, TSLP.

[5] Walt Detmar Meurers,et al. Short Answer Assessment: Establishing Links Between Research Strands , 2012, BEA@NAACL-HLT.

[6] Bo Pang,et al. Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[7] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[8] Yan Song,et al. Entropy-based Training Data Selection for Domain Adaptation , 2012, COLING.

[9] Jason R. C. Nurse,et al. Information Quality and Trustworthiness: A Topical State−of−the−Art Review , 2011 .

[10] Francesco Camastra,et al. Machine Learning for Audio, Image and Video Analysis - Theory and Applications , 2007, Advanced Information and Knowledge Processing.

[11] Debabrata Dey,et al. Reassessing Data Quality for Information Products , 2010, Manag. Sci..