Photo-z Estimation: An Example of Nonparametric Conditional Density Estimation under Selection Bias

Redshift is a key quantity for inferring cosmological model parameters. In photometric redshift estimation, cosmologists use the coarse data collected from the vast majority of galaxies to predict the redshift of individual galaxies. To properly quantify the uncertainty in the predictions, however, one needs to go beyond standard regression and instead estimate the full conditional density f(z|x) of a galaxy's redshift z given its photometric covariates x. The problem is further complicated by selection bias: usually only the rarest and brightest galaxies have known redshifts, and these galaxies have characteristics and measured covariates that do not necessarily match those of more numerous and dimmer galaxies of unknown redshift. Unfortunately, there is not much research on how to best estimate complex multivariate densities in such settings. Here we describe a general framework for properly constructing and assessing nonparametric conditional density estimators under selection bias, and for combining two or more estimators for optimal performance. We propose new improved photo-z estimators and illus- trate our methods on data from the Sloan Data Sky Survey and an application to galaxy-galaxy lensing. Although our main application is photo-z estimation, our methods are relevant to any high-dimensional regression setting with complicated asymmetric and multimodal distributions in the response variable.

[1]  Aniruddha R. Thakar,et al.  ERRATUM: “THE EIGHTH DATA RELEASE OF THE SLOAN DIGITAL SKY SURVEY: FIRST DATA FROM SDSS-III” (2011, ApJS, 193, 29) , 2011 .

[2]  THE DEEP GROTH STRIP GALAXY REDSHIFT SURVEY. III. REDSHIFT CATALOG AND PROPERTIES OF GALAXIES , 2004, astro-ph/0411128.

[3]  Yanxia Zhang,et al.  Review of techniques for photometric redshift estimation , 2012, Other Conferences.

[4]  Takafumi Kanamori,et al.  Statistical analysis of kernel-based least-squares density-ratio estimation , 2012, Machine Learning.

[5]  J. Frieman,et al.  Photometric Redshift Error Estimators , 2007, 0711.0962.

[6]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[7]  Anna Margolis,et al.  A Literature Review of Domain Adaptation with Unlabeled Data , 2011 .

[8]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[9]  A. Fontana,et al.  A CRITICAL ASSESSMENT OF PHOTOMETRIC REDSHIFT METHODS: A CANDELS INVESTIGATION , 2013, 1308.5353.

[10]  D. Wittman,et al.  WHAT LIES BENEATH: USING p(z) TO REDUCE SYSTEMATIC PHOTOMETRIC REDSHIFT ERRORS , 2009, 0905.0892.

[11]  Yuan Yao,et al.  Mercer's Theorem, Feature Maps, and Smoothing , 2006, COLT.

[12]  R. J. Brunner,et al.  TPZ: photometric redshift PDFs and ancillary information by using prediction trees and random forests , 2013, 1303.7269.

[13]  A. Fernandez-Soto,et al.  A New Catalog of Photometric Redshifts in the Hubble Deep Field , 1999 .

[14]  Marco Loog,et al.  Nearest neighbor-based importance weighting , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[15]  Rafael Izbicki,et al.  High-Dimensional Density Ratio Estimation with Extensions to Approximate Likelihood Computation , 2014, AISTATS.

[16]  Karsten M. Borgwardt,et al.  Covariate Shift by Kernel Mean Matching , 2009, NIPS 2009.

[17]  Fabian Gieseke,et al.  Nearest neighbor density ratio estimation for large-scale applications in astronomy , 2015, Astron. Comput..

[18]  R. Koenker Quantile Regression: Fundamentals of Quantile Regression , 2005 .

[19]  Francisco Herrera,et al.  A unifying view on dataset shift in classification , 2012, Pattern Recognit..

[20]  E. al.,et al.  The Sloan Digital Sky Survey: Technical summary , 2000, astro-ph/0006396.

[21]  Strong consistency of the kernel estimators of conditional density function , 1985 .

[22]  Rachel Mandelbaum,et al.  PHOTOMETRIC REDSHIFT PROBABILITY DISTRIBUTIONS FOR GALAXIES IN THE SDSS DR8 , 2011, 1109.5192.

[23]  Steffen Bickel,et al.  Discriminative Learning Under Covariate Shift , 2009, J. Mach. Learn. Res..

[24]  Takafumi Kanamori,et al.  A Least-squares Approach to Direct Importance Estimation , 2009, J. Mach. Learn. Res..

[25]  Huan Lin,et al.  Estimating the redshift distribution of photometric galaxy samples – II. Applications and tests of a new method , 2008, 0801.3822.

[26]  Norman R. Swanson,et al.  Predictive Density Evaluation , 2005 .

[27]  L. Wasserman All of Nonparametric Statistics , 2005 .

[28]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[29]  Canada.,et al.  Data Mining and Machine Learning in Astronomy , 2009, 0906.2173.

[30]  M. Kawanabe,et al.  Direct importance estimation for covariate shift adaptation , 2008 .

[31]  Ann B. Lee,et al.  Nonparametric Conditional Density Estimation in a High-Dimensional Regression Setting , 2016, 1604.00540.

[32]  Huan Lin,et al.  Estimating the redshift distribution of photometric galaxy samples , 2008 .

[33]  S. J. Lilly,et al.  Precision photometric redshift calibration for galaxy–galaxy weak lensing , 2007, 0709.1692.

[34]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[35]  P. Hall On Kullback-Leibler loss and density estimation , 1987 .

[36]  Carlos S. Frenk,et al.  The large-scale structure of the Universe , 2006, Nature.