What is Your Metric Telling You? Evaluating Classifier Calibration under Context-Specific Definitions of Reliability