Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores

Although most NoSQL Data Stores are schema-less, information on the structural properties of the persisted data is nevertheless essential during application development. Otherwise, accessing the data becomes simply impractical. In this paper, we introduce an algorithm for schema extraction that is operating outside of the NoSQL data store. Our method is specifically targeted at semi-structured data persisted in NoSQL stores, e.g., in JSON format. Rather than designing the schema up front, extracting a schema in hindsight can be seen as a reverse-engineering step. Based on the extracted schema information, we propose set of similarity measures that capture the degree of heterogeneity of JSON data and which reveal structural outliers in the data. We evaluate our implementation on two real-life datasets: a database from the Wendelstein 7-X project and Web Performance Data.

[1]  Ee-Peng Lim,et al.  DTD-Miner: a tool for mining DTD from XML documents , 2000, Proceedings Second International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems. WECWIS 2000.

[2]  Kenji Takeda,et al.  Strongly-Typed Language Support for Internet- Scale Information Sources , 2012 .

[3]  Thomas Schwentick,et al.  Inference of concise regular expressions and DTDs , 2010, TODS.

[4]  Timos K. Sellis,et al.  A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[5]  Meike Klettke,et al.  Schemaless NoSQL Data Stores - Object-NoSQL Mappers to the Rescue? , 2015, BTW.

[6]  Dario Colazzo,et al.  Typing Massive JSON Datasets , 2012 .

[7]  Andreas Heuer,et al.  PageBeat - Zeitreihenanalyse und Datenbanken , 2014, Grundlagen von Datenbanken.

[8]  Meike Klettke,et al.  Managing Schema Evolution in NoSQL Data Stores , 2013, DBPL.

[9]  Giuseppe Castagna,et al.  Static and dynamic semantics of NoSQL languages , 2013, POPL.

[10]  Shashank Tiwari,et al.  Professional NoSQL , 2011 .

[11]  Rick Cattell,et al.  Scalable SQL and NoSQL data stores , 2011, SGMD.

[12]  Torsten Bluhm,et al.  A W7-X experiment program editor––A usage driven development , 2012 .

[13]  A. Brandstädt,et al.  Graph Classes: A Survey , 1987 .

[14]  Felix Naumann,et al.  XStruct: Efficient Schema Extraction from Multiple and Large XML Documents , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[15]  Meike Klettke,et al.  Datenbanken ohne Schema? , 2014, Datenbank-Spektrum.