On the Usefulness of the Fit-on-the-Test View on Evaluating Calibration of Classifiers