Column Store for GWAC: A High-cadence, High-density, Large-scale Astronomical Light Curve Pipeline and Distributed Shared-nothing Database

The ground-based wide-angle camera array (GWAC), a part of the SVOM space mission, will search for various types of optical transients by continuously imaging a field of view (FOV) of 5000 degrees2 every 15 s. Each exposure consists of 36 × 4k × 4k pixels, typically resulting in 36 × ~175,600 extracted sources. For a modern time-domain astronomy project like GWAC, which produces massive amounts of data with a high cadence, it is challenging to search for short timescale transients in both real-time and archived data, and to build long-term light curves for variable sources. Here, we develop a high-cadence, high-density light curve pipeline (HCHDLP) to process the GWAC data in real-time, and design a distributed shared-nothing database to manage the massive amount of archived data which will be used to generate a source catalog with more than 100 billion records during 10 years of operation. First, we develop HCHDLP based on the column-store DBMS of MonetDB, taking advantage of MonetDB's high performance when applied to massive data processing. To realize the real-time functionality of HCHDLP, we optimize the pipeline in its source association function, including both time and space complexity from outside the database (SQL semantic) and inside (RANGE-JOIN implementation), as well as in its strategy of building complex light curves. The optimized source association function is accelerated by three orders of magnitude. Second, we build a distributed database using a two-level time partitioning strategy via the MERGE TABLE and REMOTE TABLE technology of MonetDB. Intensive tests validate that our database architecture is able to achieve both linear scalability in response time and concurrent access by multiple users. In summary, our studies provide guidance for a solution to GWAC in real-time data processing and management of massive data.

[1]  Jacek Becla,et al.  Report from the 3rd Workshop on Extremely Large Databases , 2008, Data Sci. J..

[2]  Markus Loose,et al.  Interpixel crosstalk in Teledyne Imaging Sensors H4RG-10 detectors. , 2012, Applied optics.

[3]  L.H.A. Scheers Transient and variable radio sources in the LOFAR sky: an architecture for a detection framework , 2011 .

[4]  S. Markoff,et al.  LOFAR - low frequency array , 2006 .

[5]  D. Egret,et al.  Towards Dynamic Catalogues , 2012 .

[6]  Jeffrey F. Naughton,et al.  Generalized Search Trees for Database Systems , 1995, VLDB.

[7]  J. Osborne,et al.  The SVOM gamma-ray burst mission , 2015, 1512.03323.

[8]  Robert Jedicke,et al.  Pan-STARRS: A Large Synoptic Survey Telescope Array , 2002, SPIE Astronomical Telescopes + Instrumentation.

[9]  Martin L. Kersten,et al.  Database Architecture Optimized for the New Bottleneck: Memory Access , 1999, VLDB.

[10]  Martin L. Kersten,et al.  MonetDB: Two Decades of Research in Column-oriented Database Architectures , 2012, IEEE Data Eng. Bull..

[11]  D. Monet,et al.  THE FOURTH US NAVAL OBSERVATORY CCD ASTROGRAPH CATALOG (UCAC4) , 2012, 1212.6182.

[12]  David J. DeWitt,et al.  Scientific data management in the coming decade , 2005, SGMD.

[13]  Andrew J. Connolly,et al.  The LSST Data Management System , 2015, 1512.07914.

[14]  Alexander S. Szalay,et al.  The Zones Algorithm for Finding Points-Near-a-Point or Cross-Matching Spatial Datasets , 2007, ArXiv.

[15]  Tara Murphy,et al.  VAST - a real-time pipeline for detecting radio transients and variables on the Australian SKA Pathfinder (ASKAP) telescope , 2012, 1201.3130.

[16]  Martin L. Kersten,et al.  Generic Database Cost Models for Hierarchical Memory Systems , 2002, VLDB.

[17]  Marcel Kornacker,et al.  High-Performance Extensible Indexing , 1999, VLDB.

[18]  William S. Burgett PS2: managing the next step in the Pan-STARRS wide field survey system , 2012, Other Conferences.

[19]  Bart Scheers,et al.  The LOFAR Transients Pipeline , 2010, Astron. Comput..

[20]  B. Scheers Database techniques within LOFAR's Transients Key Project , 2009 .

[21]  J. Curran,et al.  VAST: An ASKAP Survey for Variables and Slow Transients , 2012, Publications of the Astronomical Society of Australia.

[22]  Jacek Becla,et al.  Qserv: A distributed shared-nothing database for the LSST catalog , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[23]  Eli Upfal,et al.  Performance prediction for concurrent database workloads , 2011, SIGMOD '11.

[24]  Eduardo Serrano,et al.  LSST: From Science Drivers to Reference Design and Anticipated Data Products , 2008, The Astrophysical Journal.

[25]  Jacek Becla,et al.  Enabling Scalable Data Analytics for LSST and Beyond , 2014 .

[26]  Martin L. Kersten,et al.  Database Architecture Evolution: Mammals Flourished long before Dinosaurs became Extinct , 2009, Proc. VLDB Endow..

[27]  Herodotos Herodotou,et al.  Query optimization techniques for partitioned tables , 2011, SIGMOD '11.

[28]  Sergey E. Koposov,et al.  Q3C, Quad Tree Cube -- The new Sky-indexing Concept for Huge Astronomical Catalogues and its Realization for Main Astronomical Queries (Cone Search and Xmatch) in Open Source Database PostgreSQL , 2006 .

[29]  Martin L. Kersten,et al.  Database Cracking , 2007, CIDR.

[30]  Alexander S. Szalay,et al.  There Goes the Neighborhood: Relational Algebra for Spatial Data Search , 2004, ArXiv.

[31]  Philip A. Bernstein,et al.  Concurrency Control in Distributed Database Systems , 1986, CSUR.

[32]  Ray P. Norris Data Challenges for Next-generation Radio Telescopes , 2010, 2010 Sixth IEEE International Conference on e-Science Workshops.

[33]  Yogesh Simmhan,et al.  Stargazing through a digital veil: managing a large scale sky survey using distributed databases on HPC clusters , 2011, HPCDB '11.