论文信息 - Schema Management for Document Stores

Schema Management for Document Stores

Document stores that provide the efficiency of a schema-less interface are widely used by developers in mobile and cloud applications. However, the simplicity developers achieved controversially leads to complexity for data management due to lack of a schema. In this paper, we present a schema management framework for document stores. This framework discovers and persists schemas of JSON records in a repository, and also supports queries and schema summarization. The major technical challenge comes from varied structures of records caused by the schema-less data model and schema evolution. In the discovery phase, we apply a canonical form based method and propose an algorithm based on equivalent sub-trees to group equivalent schemas efficiently. Together with the algorithm, we propose a new data structure, eSiBu-Tree, to store schemas and support queries. In order to present a single summarized representation for heterogenous schemas in records, we introduce the concept of "skeleton", and propose to use it as a relaxed form of the schema, which captures a small set of core attributes. Finally, extensive experiments based on real data sets demonstrate the efficiency of our proposed schema discovery algorithms, and practical use cases in real-world data exploration and integration scenarios are presented to illustrate the effectiveness of using skeletons in these applications.

[1] Ke Wang,et al. Schema Discovery for Semistructured Data , 1997, KDD.

[2] Rui Xu,et al. Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[3] Hiroki Arimura,et al. Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[4] Renée J. Miller,et al. Discovering Linkage Points over Web Data , 2013, Proc. VLDB Endow..

[5] Chin-Wan Chung,et al. Efficient extraction of schemas for XML documents , 2003, Inf. Process. Lett..

[6] Mohammed J. Zaki. Efficiently mining frequent trees in a forest , 2002, KDD.

[7] Serge Abiteboul,et al. Extracting schema from semistructured data , 1998, SIGMOD '98.

[8] Kam-Fai Wong,et al. Approximate Graph Schema Extraction for Semi-Structured Data , 2000, EDBT.

[9] NestorovSvetlozar,et al. Extracting schema from semistructured data , 1998 .

[10] Cong Yu,et al. Schema summarization , 2006, VLDB.

[11] Felix Naumann,et al. XStruct: Efficient Schema Extraction from Multiple and Large XML Documents , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[12] Renée J. Miller,et al. Linking Semistructured Data on the Web , 2011, WebDB.

[13] Doug McMahon,et al. JSON data management: supporting schema-less development in RDBMS , 2014, SIGMOD Conference.

[14] Yun Chi,et al. Canonical forms for labelled trees and their applications in frequent subtree mining , 2005, Knowledge and Information Systems.

[15] Kyuseok Shim,et al. XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[16] Erhard Rahm,et al. A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[17] C. Mohan,et al. Are we experiencing a big data bubble? , 2014, SIGMOD Conference.

[18] Jeffrey D. Ullman,et al. Representative objects: concise representations of semistructured, hierarchical data , 1997, Proceedings 13th International Conference on Data Engineering.

[19] Rajasekar Krishnamurthy,et al. Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study , 2015, IEEE Data Eng. Bull..