Rejoinder Hedging Predictions in Machine Learning

As we say in the article, the two most important properties expected from confidence predictors are validity (they must tell the truth) and efficiency (the truth must be as informative as possible). Conformal predictors are automatically valid, so there is little to discuss here, but so far achieving efficiency has been an art, to a large degree, and Alexey Chervonenkis, Phil Long and Sally McClean comment on this aspect of conformal prediction. Indeed, as Prof. Chervonenkis notices, the article does not contain any theoretical results about efficiency. Such a result appears as Theorem 3.1 in our book [A3]. We use a nonconformity measure based on the nearest neighbours procedure to obtain a conformal predictor whose efficiency asymptotically approaches that of the Bayes-optimal confidence predictor. (Remember that the Bayes-optimal confidence predictor is optimized under the true probability distribution, which is unknown to the Predictor.) This result only applies to the case of classification, and it is asymptotic. Nevertheless, it is our only step towards a ‘more principled way of designing good measures of strangeness’, as Prof. McClean puts it. Her question suggests the desirability of such more principled ways; we agree and would very much welcome further results in this direction. An important aspect of efficiency is conditionality, discussed at length in [A3] (see e.g. p. 11). It would be ideal if we were able to learn the conditional probability distribution for the next label. Unfortunately, this is impossible under the unconstrained assumption of randomness, even in the case of binary classification ([A3], Chapter 5). The definition of validity is given in terms of unconditional probability, and this appears unavoidable. However, Prof. Chervonenkis’s worry that for some objects the prediction interval might be too wide and for others too narrow has been addressed in [A3]. If our objects are of several different types, the version of conformal predictors that we call ‘attribute-conditional Mondrian conformal predictors’ in [A3] (Section 4.5) will make sure that we have separate validity for each type of objects. For example, in medical applications with patients as objects, we can always ensure separate validity for men and women.