LEO-Py: Estimating likelihoods for correlated, censored, and uncertain data with given marginal distributions

Data with uncertain, missing, censored, and correlated values are commonplace in many research fields including astronomy. Unfortunately, such data are often treated in an ad hoc way in the astronomical literature potentially resulting in inconsistent parameter estimates. Furthermore, in a realistic setting, the variables of interest or their errors may have non-normal distributions which complicates the modeling. I present a novel approach to compute the likelihood function for such data sets. This approach employs Gaussian copulas to decouple the correlation structure of variables and their marginal distributions resulting in a flexible method to compute likelihood functions of data in the presence of measurement uncertainty, censoring, and missing data. I demonstrate its use by determining the slope and intrinsic scatter of the star forming sequence of nearby galaxies from observational data. The outlined algorithm is implemented as the flexible, easy-to-use, open-source Python package LEO-Py.

[1]  Columbia,et al.  Star Formation in AEGIS Field Galaxies since z = 1.1: The Dominance of Gradually Declining Star Formation, and the Main Sequence of Star-forming Galaxies , 2007, astro-ph/0701924.

[2]  H. Fu,et al.  THE INTRINSIC SCATTER ALONG THE MAIN SEQUENCE OF STAR-FORMING GALAXIES AT z ∼ 0.7 , 2013, 1309.4093.

[3]  Joseph E. Cavanaugh,et al.  Handbook of Epidemiology , 2006 .

[4]  Jing Wang,et al.  The GALEX Arecibo SDSS survey – III. Evidence for the inside‐out formation of Galactic discs , 2010, 1011.0829.

[5]  R. Gill,et al.  History of applications of martingales in survival analysis. , 2010, 1003.0188.

[6]  V. Springel,et al.  The star formation activity of IllustrisTNG galaxies: main sequence, UVJ diagram, quenched fractions, and systematics , 2018, Monthly Notices of the Royal Astronomical Society.

[7]  V. Springel,et al.  The star formation main sequence and stellar mass assembly of galaxies in the Illustris simulation , 2014, 1409.0009.

[8]  D. Elbaz,et al.  THE CONTRIBUTION OF STARBURSTS AND NORMAL GALAXIES TO INFRARED LUMINOSITY FUNCTIONS AT z < 2 , 2012, 1202.0290.

[9]  A. Robotham,et al.  Hyper-Fit: Fitting Linear Models to Multidimensional Data with Multivariate Gaussian Uncertainties , 2015, Publications of the Astronomical Society of Australia.

[10]  S. Bamford,et al.  Galaxy And Mass Assembly (GAMA): Linking Star Formation Histories and Stellar Mass Growth , 2013, 1306.2424.

[11]  D. Schiminovich,et al.  xGASS: total cold gas scaling relations and molecular-to-atomic gas ratios of galaxies in the local Universe , 2018, 1802.02373.

[12]  J. Ibrahim,et al.  Handbook of survival analysis , 2014 .

[13]  S. Eales,et al.  The Galaxy end sequence , 2016, 1611.00367.

[14]  J. Bovy,et al.  Data analysis recipes: Fitting a model to data , 2010, 1008.4686.

[15]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[16]  S. Maddox,et al.  The new galaxy evolution paradigm revealed by the Herschel surveys , 2017, 1710.01314.

[17]  Dieter Kraft,et al.  Algorithm 733: TOMP–Fortran modules for optimal control calculations , 1994, TOMS.

[18]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[19]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[20]  Christopher D. Martin,et al.  The GALEX Arecibo SDSS Survey I: gas fraction scaling relations of massive galaxies and first data release , 2009, 0912.1610.

[21]  G. J. Babu,et al.  Statistical Methods for Astronomy , 2012, 1205.2064.

[22]  J. Brian Gray,et al.  Introduction to Linear Regression Analysis , 2002, Technometrics.

[23]  L. Cortese,et al.  xGASS: gas-rich central galaxies in small groups and their connections to cosmic web gas feeding , 2017, 1701.01754.

[24]  C. Frenk,et al.  Evolution of galaxy stellar masses and star formation rates in the eagle simulations , 2014, 1410.3485.

[25]  G. Brammer,et al.  CONSTRAINING THE LOW-MASS SLOPE OF THE STAR FORMATION SEQUENCE AT 0.5 < z < 2.5 , 2014, 1407.1843.

[26]  J. Doye,et al.  Global Optimization by Basin-Hopping and the Lowest Energy Structures of Lennard-Jones Clusters Containing up to 110 Atoms , 1997, cond-mat/9803344.

[27]  A. Albert Conditions for Positive and Nonnegative Definiteness in Terms of Pseudoinverses , 1969 .

[28]  S. Brough,et al.  Galaxy And Mass Assembly (GAMA): The sSFR-M* relation part I – σsSFR-M* as a function of sample, SFR indicator and environment , 2018, Monthly Notices of the Royal Astronomical Society.

[29]  S. Thompson,et al.  Correcting for regression dilution bias: comparison of methods for a single predictor variable , 2000 .

[30]  Gilbert MacKenzie,et al.  The Statistical Analysis of Failure Time Data , 1982 .

[31]  Emiliano A. Valdez,et al.  Understanding Relationships Using Copulas , 1998 .

[32]  N. Kolev,et al.  Copulas: A Review and Recent Developments , 2006 .

[33]  M. Sklar Fonctions de repartition a n dimensions et leurs marges , 1959 .

[34]  R. Feldmann Are star formation rates of galaxies bimodal , 2017, 1705.03014.

[35]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[36]  Philip E. Davis,et al.  The SFR–M* Correlation Extends to Low Mass at High Redshift , 2018, The Astrophysical Journal.

[37]  N. Caplar,et al.  Stochastic modelling of star-formation histories I: the scatter of the star-forming main sequence , 2019, Monthly Notices of the Royal Astronomical Society.

[38]  J. Starck,et al.  The reversal of the star formation-density relation in the distant universe , 2007, astro-ph/0703653.

[39]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[40]  Paul A. Bekker,et al.  THE POSITIVE SEMIDEFINITENESS OF PARTITIONED MATRICES , 1988 .

[41]  B. Kelly Some Aspects of Measurement Error in Linear Regression of Astronomical Data , 2007, 0705.2774.

[42]  A. Cimatti,et al.  Multiwavelength Study of Massive Galaxies at z~2. I. Star Formation and Galaxy Growth , 2007, 0705.2831.

[43]  Roger B. Nelsen,et al.  Copulas, Characterization, Correlation, and Counterexamples , 1995 .

[44]  Daniel Foreman-Mackey,et al.  emcee: The MCMC Hammer , 2012, 1202.3665.

[45]  J. Kalbfleisch,et al.  The Statistical Analysis of Failure Time Data: Kalbfleisch/The Statistical , 2002 .

[46]  Travis E. Oliphant,et al.  Guide to NumPy , 2015 .

[47]  W. D. Ray 4. Modelling Survival Data in Medical Research , 1995 .

[48]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[49]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[50]  Thomas J. Loredo,et al.  Bayesian astrostatistics: a backward look to the future , 2012, 1208.3036.

[51]  P. Hopkins,et al.  Galaxies on FIRE (Feedback In Realistic Environments): stellar feedback explains cosmologically inefficient star formation , 2013, 1311.2073.