Shooting Craps in Search of an Optimal Strategy for Training Connectionist Pattern Classifiers

We compare two strategies for training connectionist (as well as non-connectionist) models for statistical pattern recognition. The probabilistic strategy is based on the notion that Bayesian discrimination (i.e., optimal classification) is achieved when the classifier learns the a posteriori class distributions of the random feature vector. The differential strategy is based on the notion that the identity of the largest class a posteriori probability of the feature vector is all that is needed to achieve Bayesian discrimination. Each strategy is directly linked to a family of objective functions that can be used in the supervised training procedure. We prove that the probabilistic strategy - linked with error measure objective functions such as mean-squared-error and cross-entropy - typically used to train classifiers necessarily requires larger training sets and more complex classifier architectures than those needed to approximate the Bayesian discriminant function. In contrast. we prove that the differential strategy - linked with classificationfigure-of-merit objective functions (CFMmono) [3] - requires the minimum classifier functional complexity and the fewest training examples necessary to approximate the Bayesian discriminant function with specified precision (measured in probability of error). We present our proofs in the context of a game of chance in which an unfair C-sided die is tossed repeatedly. We show that this rigged game of dice is a paradigm at the root of all statistical pattern recognition tasks, and demonstrate how a simple extension of the concept leads us to a general information-theoretic model of sample complexity for statistical pattern recognition.