The binomial cumulative distribution function, or, is my system better than yours?

In human language technology, it is becoming more and more common to run systematic evaluations in which two or more systems, or two or more versions of the same system, are pitted one against the other. We propose the binomial cumulative distribution function as a way to assess the cumulative effect of the measures collected in such evaluations. We present an application of this measure to the evaluation of the NL interface to an Intelligent Tutoring System. We conclude by discussing a few issues pertaining to this statistical measure.