Just-In-Time Data Virtualization: Lightweight Data Management with ViDa

As the size of data and its heterogeneity increase, traditional database system architecture becomes an obstacle to data analysis. Integrating and ingesting (loading) data into databases is quickly becoming a bottleneck in face of massive data as well as increasingly heterogeneous data formats. Still, state-of-the-art approaches typically rely on copying and transforming data into one (or few) repositories. Queries, on the other hand, are often ad-hoc and supported by pre-cooked operators which are not adaptive enough to optimize access to data. As data formats and queries increasingly vary, there is a need to depart from the current status quo of static query processing primitives and build dynamic, fully adaptive architectures. We build ViDa, a system which reads data in its raw format and processes queries using adaptive, just-in-time operators. Our key insight is use of virtualization, i.e., abstracting data and manipulating it regardless of its original format, and dynamic generation of operators. ViDa’s query engine is generated just-in-time; its caches and its query operators adapt to the current query and the workload, while also treating raw datasets as its native storage structures. Finally, ViDa features a language expressive enough to support heterogeneous data models, and to which existing languages can be translated. Users therefore have the power to choose the language best suited for an analysis.

[1]  Dan Suciu,et al.  Comprehension syntax , 1994, SGMD.

[2]  Stratos Idreos,et al.  dbTouch: Analytics at your Fingertips , 2013, CIDR.

[3]  Ryan Johnson,et al.  Here are my Data Files. Here are my Queries. Where are my Results? , 2011, CIDR.

[4]  Emanuele Della Valle,et al.  Exposing Heterogeneous Data Sources as SPARQL Endpoints through an Object-Oriented Abstraction , 2008, ASWC.

[5]  Fons Rademakers,et al.  ROOT — An object oriented data analysis framework , 1997 .

[6]  Torsten Grust,et al.  Translating OQL into Monoid Comprehensions : Stuck with Nested Loops? , 1996 .

[7]  Nong Li,et al.  Runtime Code Generation in Cloudera Impala , 2014, IEEE Data Eng. Bull..

[8]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[9]  Martin L. Kersten,et al.  Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories , 2012, SSDBM.

[10]  David Maier,et al.  Optimizing object queries using an effective calculus , 2000, TODS.

[11]  Olga Papaemmanouil,et al.  Explore-by-example: an automatic query steering framework for interactive data exploration , 2014, SIGMOD Conference.

[12]  Irving L. Traiger,et al.  A history and evaluation of System R , 1981, CACM.

[13]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[14]  Tim Kraska,et al.  Tupleware: Redefining Modern Analytics , 2014, ArXiv.

[15]  Milind Bhandarkar,et al.  HAWQ: a massively parallel processing SQL engine in hadoop , 2014, SIGMOD Conference.

[16]  H. V. Jagadish,et al.  Guided Interaction: Rethinking the Query-Result Paradigm , 2011, Proc. VLDB Endow..

[17]  Erik Meijer The world according to LINQ , 2011, CACM.

[18]  Limsoon Wong,et al.  A Data Transformation System for Biological Data Sources , 1995, VLDB.

[19]  Anastasia Ailamaki,et al.  NoDB: efficient query execution on raw data files , 2012, Commun. ACM.

[20]  Betty Salzberg,et al.  Review - Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources , 1999, ACM SIGMOD Digit. Rev..

[21]  J. S. Saini,et al.  Adaptive Query Processing , 2006 .

[22]  Tiark Rompf,et al.  Errata for "Building Efficient Query Engines in a High-Level Language" (PVLDB 7(10): 853-864) , 2014, Proc. VLDB Endow..

[23]  Karlheinz Meier,et al.  Introducing the Human Brain Project , 2011, FET.

[24]  Stratis Viglas,et al.  Generating code for holistic query evaluation , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[25]  Thomas Heinis,et al.  Challenges and Opportunities in Self-Managing Scientific Databases , 2011, IEEE Data Eng. Bull..

[26]  Martin L. Kersten,et al.  The researcher's guide to the data deluge , 2011, Proc. VLDB Endow..

[27]  Michael Stonebraker,et al.  The Architecture of SciDB , 2011, SSDBM.

[28]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[29]  Giuseppe Ottaviano,et al.  Semi-indexing semi-structured data in tiny space , 2011, CIKM '11.

[30]  Anastasia Ailamaki,et al.  Adaptive Query Processing on RAW Data , 2014, Proc. VLDB Endow..

[31]  Yu Cheng,et al.  Parallel in-situ data processing with speculative loading , 2014, SIGMOD Conference.

[32]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[33]  Mary Roth,et al.  Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources , 1997, VLDB.

[34]  David Walker,et al.  LearnPADS + + : Incremental Inference of Ad Hoc Data Formats , 2012, PADL.

[35]  Philip Wadler,et al.  Comprehending monads , 1990, Mathematical Structures in Computer Science.

[36]  David J. DeWitt,et al.  Split query processing in polybase , 2013, SIGMOD '13.

[37]  Arie Shoshani,et al.  Parallel data analysis directly on scientific file formats , 2014, SIGMOD Conference.

[38]  Anastasia Ailamaki,et al.  H2O: a hands-free adaptive store , 2014, SIGMOD Conference.

[39]  Michael Stonebraker,et al.  Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[40]  Thomas Neumann,et al.  Efficiently Compiling Efficient Query Plans for Modern Hardware , 2011, Proc. VLDB Endow..

[41]  Samuel Madden,et al.  The Case for RodentStore: An Adaptive, Declarative Storage System , 2009, CIDR.

[42]  Hamid Pirahesh,et al.  Compiled Query Execution Engine using JVM , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[43]  Beng Chin Ooi,et al.  Towards unified ad-hoc data processing , 2014, SIGMOD Conference.

[44]  Leonidas Fegaras,et al.  Optimizing Queries with Object Updates , 1999, Journal of Intelligent Information Systems.

[45]  Abraham Silberschatz,et al.  Invisible loading: access-driven data transfer from raw files into database systems , 2013, EDBT '13.

[46]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[47]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[48]  David Maier,et al.  Towards an effective calculus for object query languages , 1995, SIGMOD '95.

[49]  Limsoon Wong,et al.  Kleisli, a functional query system , 2000, J. Funct. Program..

[50]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[51]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..