An Evaluation of Validation Metrics for Probabilistic Model Outputs