Handling Big Data in Astronomy and Astrophysics: Rich Structured Queries on Replicated Cloud Data with XtreemFS

With recent observational instruments and survey campaigns in astrophysics, efficient analysis of big structured data becomes more and more relevant. While providing good query expressiveness and data analysis capabilities through SQL, off-the-shelf RDBMS are yet not well prepared to handle high volume scientific data distributed across several nodes, neither for fast data ingest nor for fast spatial queries. Our SQL query parser and job manager performs query reformulation to spread queries to data nodes, gathering outputs on a head node and providing them again to the shards for subsequent processing steps. We combine this data analysis architecture with the cloud data storage component XtreemFS for automatic data replication to improve the availability and access latency. With our solution we perform rich structured data analysis expressed using SQL on large amounts of structured astrophysical data distributed across numerous storage nodes in parallel. The cloud storage virtualization with XtreemFS provides elasticity and reproducibility of scientific analysis tasks through its snapshot capability.

[1]  Felix Hupfeld,et al.  BabuDB: Fast and Efficient File System Metadata Storage , 2010, 2010 International Workshop on Storage Network Architecture and Parallel I/Os.

[2]  Christos Faloutsos,et al.  On packing R-trees , 1993, CIKM '93.

[3]  Friedemann Mattern,et al.  Virtual Time and Global States of Distributed Systems , 2002 .

[4]  Dimosthenis Kyriazis,et al.  Data Intensive Storage Services for Cloud Environments , 2013, Data Intensive Storage Services for Cloud Environments.

[5]  Eugenio Cesario,et al.  The XtreemFS architecture—a case for object‐based file systems in Grids , 2008, Concurr. Comput. Pract. Exp..

[6]  G. Lemson,et al.  Halo and Galaxy Formation Histories from the Millennium Simulation: Public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony , 2006, astro-ph/0608019.

[7]  Nancy A. Lynch,et al.  Revisiting the PAXOS algorithm , 1997, Theor. Comput. Sci..

[8]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[9]  K. G. Begeman,et al.  LOFAR Information System , 2011, Future generations computer systems.

[10]  Alexander S. Szalay,et al.  Implementing a General Spatial Indexing Library for Relational Databases of Large Numerical Simulations , 2011, SSDBM.

[11]  Felix Hupfeld,et al.  Flease - Lease Coordination Without a Lock Server , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[12]  Jens Klump,et al.  Langzeitarchivierung von Forschungsdaten. Eine Bestandsaufnahme , 2012 .

[13]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[14]  Kristin Riebe,et al.  The MultiDark Database: Release of the Bolshoi and MultiDark Cosmological Simulations , 2011, ArXiv.

[15]  Alexander Reinefeld,et al.  XtreemFS – a File System for the Cloud , 2013 .

[16]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.