论文信息 - Dataset Generation Patterns for Evaluating Knowledge Graph Construction

Dataset Generation Patterns for Evaluating Knowledge Graph Construction

Confidentiality hinders the publication of authentic, labeled datasets of personal and enterprise data, although they could be useful for evaluating knowledge graph construction approaches in industrial scenarios. Therefore, our plan is to synthetically generate such data in a way that it appears as authentic as possible. Based on our assumption that knowledge workers have certain habits when they produce or manage data, generation patterns could be discovered which can be utilized by data generators to imitate real datasets. In this paper, we initially derived 11 distinct patterns found in real spreadsheets from industry and demonstrate a suitable generator called Data Sprout that is able to reproduce them. We describe how the generator produces spreadsheets in general and what altering effects the implemented patterns have.

[1] Bogdan Pavković,et al. Data generators: a short survey of techniques and use cases with focus on testing , 2019, 2019 IEEE 9th International Conference on Consumer Electronics (ICCE-Berlin).

[2] Claudia Niederée,et al. Managed Forgetting to Support Information Management and Knowledge Work , 2018, KI - Künstliche Intelligenz.

[3] Max Jacobson,et al. A Pattern Language: Towns, Buildings, Construction , 1981 .

[4] Michael Schulze,et al. The Person Index Challenge: Extraction of Persons from Messy, Short Texts , 2020, ArXiv.