论文信息 - Characteristics of Open Data CSV Files

Characteristics of Open Data CSV Files

This work analyzes an Open Data corpus containing 200K tabular resources with a total file size of 413 GB from a data consumer perspective. Our study shows that ~10% of the resources in Open Data portals are labelled as a tabular data of which only 50% can be considered CSV files. The study inspects the general shape of these tabular data, reports on column and row distribution, analyses the availability of (multiple) header rows and if a file contains multiple tables. In addition, we inspect and analyze the table column types, detect missing values and report about the distribution of the values.

[1] Sören Auer,et al. User-driven semantic mapping of tabular data , 2013, I-SEMANTICS '13.

[2] Jürgen Umbrich,et al. Quality assessment & evolution of Open Data portals , 2015 .

[3] Dominique Ritze,et al. Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases , 2016, WWW.

[4] Haixun Wang,et al. Understanding Tables on the Web , 2012, ER.

[5] Jürgen Umbrich,et al. Quality Assessment and Evolution of Open Data Portals , 2015, 2015 3rd International Conference on Future Internet of Things and Cloud.

[6] Dominique Ritze,et al. Matching HTML Tables to DBpedia , 2015, WIMS.

[7] Oktie Hassanzadeh,et al. Understanding a large corpus of web tables through matching with knowledge bases: an empirical study , 2015, OM.

[8] Yakov Shafranovich,et al. Common Format and MIME Type for Comma-Separated Values (CSV) Files , 2005, RFC.

[9] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[10] Eric Crestan,et al. Web-scale table census and classification , 2011, WSDM '11.