Schema Inference for Massive JSON Datasets

Recent years have seen the widespread use of JSON as a data format to represent massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences: the correctness of complex queries and programs cannot be statically checked, users cannot rely on schema information to quickly figure out structural properties that could speed up the formulation of correct queries, and many schema-based optimizations are not possible. In this paper we deal with the problem of inferring a schema from massive JSON data sets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our main contribution, which is the design of a schema inference algorithm, its theoretical study and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, precision and conciseness of inferred schemas, and scalability.

[1]  Giuseppe Castagna,et al.  Type-based XML projection , 2006, VLDB.

[2]  Eduardo Cunha de Almeida,et al.  Finding and Fixing Type Mismatches in the Evolution of Object-NoSQL Mappings , 2016, EDBT/ICDT Workshops.

[3]  Tim Bray,et al.  Internet Engineering Task Force (ietf) the Javascript Object Notation (json) Data Interchange Format , 2022 .

[4]  Dario Colazzo,et al.  Typing Massive JSON Datasets , 2012 .

[5]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[6]  Chen Wang,et al.  Schema Management for Document Stores , 2015, Proc. VLDB Endow..

[7]  Andrey Balmin,et al.  Jaql , 2011, Proc. VLDB Endow..

[8]  Dario Colazzo,et al.  Parametric schema inference for massive JSON datasets , 2019, The VLDB Journal.

[9]  Jennifer Widom,et al.  Query Optimization for XML , 1999, VLDB.

[10]  Timo Kötzing,et al.  Fast Learning of Restricted Regular Expressions and DTDs , 2014, Theory of Computing Systems.

[11]  Doug McMahon,et al.  JSON data management: supporting schema-less development in RDBMS , 2014, SIGMOD Conference.

[12]  Serge Abiteboul,et al.  Inferring structure in semistructured data , 1997, SGMD.

[13]  Martín Ugarte,et al.  Foundations of JSON Schema , 2016, WWW.

[14]  Rehan Zaidi JavaScript Object Notation (JSON) , 2017 .

[15]  François Goasdoué,et al.  Query-Oriented Summarization of RDF Graphs , 2015, Proc. VLDB Endow..

[16]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[17]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[18]  Daniel J. Abadi,et al.  Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data , 2016, SIGMOD Conference.

[19]  Murali Mani,et al.  Taxonomy of XML schema languages using formal language theory , 2005, TOIT.