Embracing Uncertainty in Large-Scale Computational Astrophysics.

A revolution is underway in astronomy resulting from massive astrophysical surveys providing a panchromatic view of the night sky. The next generation of surveys and the simulations used to calibrate them can produce in two nights what the previous generation produced over many years. This enormous image acquisition capability allows the telescope to revisit areas of the sky with sufficient frequency to expose dynamic features and transient events; e.g., asteroids whose trajectories may intersect Earth. At least three such surveys are planned; their collective output must be integrated and calibrated against computational simulations, prior surveys, and each other. Relational databases have been shown to be effective for astronomy at yesterday’s scale, but new access to the temporal dimension and increased intercomparison of multiple sources generate new sources of uncertainty that must be modeled explicitly in the database. Conventional relational database management systems are not cognizant of this uncertainty, requiring random variables to be prematurely and artificially collapsed prior to manpiulation. Previous results in probabilistic databases focus on discrete attribute values and are unproven at large scale. In this paper, we present concrete examples of probabilistic query processing from computational astrophysics, and use them to motivate new directions of research: continuous-valued attributes and queries involving complex aggregates over such attributes.

[1]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[2]  T. S. Jayram,et al.  Efficient allocation algorithms for OLAP over imprecise data , 2006, VLDB.

[3]  T. S. Jayram,et al.  Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[4]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[5]  Christopher Ré,et al.  Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization , 2007, VLDB.

[6]  Dan Suciu,et al.  Processing XML Streams with Deterministic Automata , 2003, ICDT.

[7]  M. Postman,et al.  The morphology-density relation - The group connection , 1984 .

[8]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[9]  R. Nichol,et al.  On Departures from a Power Law in the Galaxy Correlation Function , 2003, astro-ph/0301280.

[10]  Matt Brown,et al.  Invited talk , 2007 .

[11]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[12]  Andrew J. Connolly,et al.  Marked correlations in galaxy formation models , 2005 .

[13]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[14]  David J. DeWitt,et al.  Parallel algorithms for the execution of relational database operations , 1983, TODS.

[15]  Stavros G. Kolliopoulos,et al.  Tight approximation results for general covering integer programs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[16]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[17]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[18]  Neta A. Bahcall,et al.  The Dependence on Environment of the Color-Magnitude Relation of Galaxies , 2003, astro-ph/0307336.

[19]  M. Giavalisco,et al.  Photometric redshifts of galaxies in COSMOS , 2006 .

[20]  A. Szalay,et al.  Slicing Through Multicolor Space: Galaxy Redshifts from Broadband Photometry , 1995, astro-ph/9508100.

[21]  Christopher Ré,et al.  Efficient Evaluation of , 2007, DBPL.

[22]  A. Mazure,et al.  The VIMOS VLT deep survey , 2008, 0903.0271.

[23]  Yuri Gurevich,et al.  The complexity of query reliability , 1998, PODS.

[24]  E. Spillar,et al.  Photometric Redshifts of Galaxies , 1986 .

[25]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[26]  Alexander L. Wolf,et al.  Content-Based Networking: A New Communication Infrastructure , 2001, Infrastructure for Mobile and Wireless Systems.

[27]  Dan Suciu,et al.  XMLTK: An XML Toolkit for Scalable XML Stream Processing , 2002 .

[28]  R. Ellis,et al.  The 2dF Galaxy Redshift Survey: the dependence of galaxy clustering on luminosity and spectral type , 2001, astro-ph/0112043.

[29]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[30]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[31]  A. Mazure,et al.  The VIMOS-VLT deep survey - galaxy luminosity function per morphological type up to z = 1.2 , 2006 .

[32]  Padova,et al.  On the environmental dependence of halo formation , 2004 .

[33]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[34]  Alex C. Snoeren,et al.  Mesh-based content routing using XML , 2001, SOSP.

[35]  Dan Suciu,et al.  The Boundary Between Privacy and Utility in Data Publishing , 2007, VLDB.

[36]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[37]  A. Hamilton,et al.  Evidence for biasing in the CfA survey , 1988 .

[38]  Gregory Dobson,et al.  Worst-Case Analysis of Greedy Heuristics for Integer Programming with Nonnegative Data , 1982, Math. Oper. Res..

[39]  Christopher Ré,et al.  Event queries on correlated probabilistic streams , 2008, SIGMOD Conference.

[40]  Hector Garcia-Molina,et al.  The SIFT information dissemination system , 1999, TODS.

[41]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[42]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[43]  Wayne Hu,et al.  Effects of Photometric Redshift Uncertainties on Weak-Lensing Tomography , 2005 .

[44]  Hector Garcia-Molina,et al.  Index structures for selective dissemination of information under the Boolean model , 1994, TODS.

[45]  Weak-lensing halo numbers and dark-matter profiles , 2001, astro-ph/0103465.

[46]  Dan Suciu,et al.  Stream processing of XPath queries with predicates , 2003, SIGMOD '03.

[47]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[48]  Kevin Chen-Chuan Chang,et al.  Probabilistic top-k and ranking-aggregate queries , 2008, TODS.

[49]  Jennifer Widom,et al.  Databases with uncertainty and lineage , 2008, The VLDB Journal.