An Integration-Oriented Ontology to Govern Evolution in Big Data Ecosystems

Big Data architectures allow to flexibly store and process heterogeneous data, from multiple sources, in their original format. The structure of those data, commonly supplied by means of REST APIs, is continuously evolving. Thus data analysts need to adapt their analytical processes after each API release. This gets more challenging when performing an integrated or historical analysis. To cope with such complexity, in this paper, we present the Big Data Integration ontology, the core construct to govern the data integration process under schema evolution by systematically annotating it with information regarding the schema of the sources. We present a query rewriting algorithm that, using the annotated ontology, converts queries posed over the ontology to queries over the sources. To cope with syntactic evolution in the sources, we present an algorithm that semi-automatically adapts the ontology upon new releases. This guarantees ontology-mediated queries to correctly retrieve data from the most recent schema version as well as correctness in historical queries. A functional and performance evaluation on real-world APIs is performed to validate our approach.

[1]  Apostolos V. Zarras,et al.  Growing up with stability: How open-source relational databases evolve , 2015, Inf. Syst..

[2]  Divesh Srivastava,et al.  The Information Manifold , 1995 .

[3]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[4]  Divesh Srivastava,et al.  Answering Queries Using Views. , 1999, PODS 1995.

[5]  Frank Wolter,et al.  Temporal Description Logic for Ontology-Based Data Access , 2013, IJCAI.

[6]  Mary Roth,et al.  Data Wrangling: The Challenging Yourney from the Wild to the Lake , 2015, CIDR.

[7]  Umeshwar Dayal,et al.  Business Intelligence for the Real-Time Enterprise , 2009 .

[8]  Christoph Quix,et al.  GEMMS: A Generic and Extensible Metadata Management System for Data Lakes , 2016, CAiSE Forum.

[9]  Volker Markl,et al.  Situational Business Intelligence , 2008, BIRTE.

[10]  Alberto Abelló,et al.  Big Data Design , 2015, DOLAP.

[11]  Ned Chapin,et al.  Types of software evolution and software maintenance , 2001, J. Softw. Maintenance Res. Pract..

[12]  Carsten Lutz,et al.  Temporal Description Logics: A Survey , 2008, 2008 15th International Symposium on Temporal Representation and Reasoning.

[13]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[14]  Ian Horrocks,et al.  Using Semantic Technology to Tame the Data Variety Challenge , 2016, IEEE Internet Computing.

[15]  Antoni Olivé,et al.  Modeling events as entities in object-oriented conceptual modeling languages , 2006, Data Knowl. Eng..

[16]  Arjan Durresi,et al.  A survey: Control plane scalability issues and approaches in Software-Defined Networking (SDN) , 2017, Comput. Networks.

[17]  Herman J. ter Horst,et al.  Extending the RDFS Entailment Lemma , 2004, SEMWEB.

[18]  Serge Abiteboul,et al.  PARIS: Probabilistic Alignment of Relations, Instances, and Schema , 2011, Proc. VLDB Endow..

[19]  Michel C. A. Klein,et al.  Ontology Evolution: Not the Same as Schema Evolution , 2004, Knowledge and Information Systems.

[20]  Gottfried Vossen,et al.  Towards Self-Service Business Intelligence , 2013 .

[21]  Jennifer Widom,et al.  Database systems - the complete book (2. ed.) , 2009 .

[22]  Shaohua Wang,et al.  How Do Developers React to RESTful API Evolution? , 2014, ICSOC.

[23]  Todd D. Millstein,et al.  Navigational Plans For Data Integration , 1999, AAAI/IAAI.

[24]  Xavier Franch,et al.  Monitoring the service-based system lifecycle with SALMon , 2015, Expert Syst. Appl..

[25]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2005, Theor. Comput. Sci..

[26]  Apostolos V. Zarras,et al.  Keep Calm and Wait for the Spike! Insights on the Evolution of Amazon Services , 2016, CAiSE.

[27]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[28]  Diego Calvanese,et al.  Linking Data to Ontologies , 2008, J. Data Semant..

[29]  Diego Calvanese,et al.  Ontop: Answering SPARQL queries over relational databases , 2016, Semantic Web.

[30]  Alberto Abelló,et al.  A Unified View of Data-Intensive Flows in Business Intelligence Systems: A Survey , 2016, Trans. Large Scale Data Knowl. Centered Syst..

[31]  Jun Li,et al.  How Does Web Service API Evolution Affect Clients? , 2013, 2013 IEEE 20th International Conference on Web Services.

[32]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[33]  C. Maria Keet,et al.  Temporal Attributes: Status and Subsumption , 2015, APCCM.

[34]  Martín Ugarte,et al.  Foundations of JSON Schema , 2016, WWW.

[35]  Cesare Pautasso,et al.  Restful web services vs. "big"' web services: making the right architectural decision , 2008, WWW.

[36]  Frank Wolter,et al.  First-Order Rewritability of Temporal Ontology-Mediated Queries , 2015, IJCAI.

[37]  Sandra Geisler,et al.  Constance: An Intelligent Data Lake System , 2016, SIGMOD Conference.

[38]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[39]  Alexandra Poulovassilis,et al.  Data integration by bi-directional schema transformation rules , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[40]  George Papastefanatos,et al.  Impact Analysis and Policy-Conforming Rewriting of Evolving Data-Intensive Ecosystems , 2015, Journal on Data Semantics.