Schema Management for Document Stores

Document stores that provide the efficiency of a schema-less interface are widely used by developers in mobile and cloud applications. However, the simplicity developers achieved controversially leads to complexity for data management due to lack of a schema. In this paper, we present a schema management framework for document stores. This framework discovers and persists schemas of JSON records in a repository, and also supports queries and schema summarization. The major technical challenge comes from varied structures of records caused by the schema-less data model and schema evolution. In the discovery phase, we apply a canonical form based method and propose an algorithm based on equivalent sub-trees to group equivalent schemas efficiently. Together with the algorithm, we propose a new data structure, eSiBu-Tree, to store schemas and support queries. In order to present a single summarized representation for heterogenous schemas in records, we introduce the concept of "skeleton", and propose to use it as a relaxed form of the schema, which captures a small set of core attributes. Finally, extensive experiments based on real data sets demonstrate the efficiency of our proposed schema discovery algorithms, and practical use cases in real-world data exploration and integration scenarios are presented to illustrate the effectiveness of using skeletons in these applications.

[1]  Ke Wang,et al.  Schema Discovery for Semistructured Data , 1997, KDD.

[2]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[3]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[4]  Renée J. Miller,et al.  Discovering Linkage Points over Web Data , 2013, Proc. VLDB Endow..

[5]  Chin-Wan Chung,et al.  Efficient extraction of schemas for XML documents , 2003, Inf. Process. Lett..

[6]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[7]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[8]  Kam-Fai Wong,et al.  Approximate Graph Schema Extraction for Semi-Structured Data , 2000, EDBT.

[9]  NestorovSvetlozar,et al.  Extracting schema from semistructured data , 1998 .

[10]  Cong Yu,et al.  Schema summarization , 2006, VLDB.

[11]  Felix Naumann,et al.  XStruct: Efficient Schema Extraction from Multiple and Large XML Documents , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[12]  Renée J. Miller,et al.  Linking Semistructured Data on the Web , 2011, WebDB.

[13]  Doug McMahon,et al.  JSON data management: supporting schema-less development in RDBMS , 2014, SIGMOD Conference.

[14]  Yun Chi,et al.  Canonical forms for labelled trees and their applications in frequent subtree mining , 2005, Knowledge and Information Systems.

[15]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[16]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[17]  C. Mohan,et al.  Are we experiencing a big data bubble? , 2014, SIGMOD Conference.

[18]  Jeffrey D. Ullman,et al.  Representative objects: concise representations of semistructured, hierarchical data , 1997, Proceedings 13th International Conference on Data Engineering.

[19]  Rajasekar Krishnamurthy,et al.  Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study , 2015, IEEE Data Eng. Bull..