Classification With Finite Memory Revisited

We consider the class of strong-mixing probability laws with positive transitions that are defined on doubly infinite sequences in a finite alphabet A. A device called the classifier (or discriminator) observes a training sequence whose probability law Q is unknown. The classifier's task is to consider a second probability law P and decide whether P = Q, or P and Q are sufficiently different according to some appropriate criterion Delta(Q,P) > Delta. If the classifier has available an infinite amount of training data, this is a simple matter. However, here we study the case where the amount of training data is limited to N letters. We define a function N<sub>Delta</sub>(Q|P), which quantifies the minimum length sequence needed to distinguish Q and P and the class M(N<sub>Delta</sub>) of all probability laws pairs (Q,P) that satisfy N<sub>Delta</sub>(Q|P) les N<sub>Delta</sub> for some given positive number N<sub>Delta</sub>. It is shown that every pair Q,P of probability laws that are sufficiently different according to the Delta criterion is contained in M(N<sub>Delta</sub>). We demonstrate that for any universal classifier there exists some Q for which the classification probability lambda(Q) = 1 for some N-sequence emerging from Q, for some P : (Q,P) epsi M circ(N<sub>Delta</sub>).Delta(Q,P) > Delta, if N < N<sub>Delta</sub>. Conversely, we introduce a classification algorithm that is essentially optimal in the sense that for every (Q,P) epsi M(N<sub>Delta</sub>), the probability of classification error lambda(Q) is uniformly vanishing with N for every P : (Q,P) epsi M circ(N<sub>Delta</sub>) if N ges N<sub>Delta</sub> <sup>1+O(log</sup> <sup>log</sup> <sup>N</sup> <sup>Delta</sup> <sup>/log</sup> <sup>N</sup> <sup>Delta</sup> <sup>)</sup>. The proposed algorithm finds the largest empirical conditional divergence for a set of contexts which appear in the tested N-sequence. The computational complexity of the classification algorithm is <i>O</i>(N<sup>2</sup>(log N)<sup>3</sup>). Also, we introduce a second simplified context classification algorithm with a computational complexity of only <i>O</i>(N(log N)<sup>4</sup>) that is efficient in the sense that for <i>every</i> <i>pair</i> (Q,P) epsi M(N<sub>Delta</sub>), the <i>pairwise</i> probability of classification error lambda(Q,P) for the pair Q,P vanishes with N if N ges N<sub>Delta</sub> <sup>1+O(log</sup> <sup>log</sup> <sup>N</sup> <sup>Delta</sup> <sup>/log</sup> <sup>N</sup> <sup>Delta</sup> <sup>)</sup>. Conversely, lambda(Q,P) = 1 at least for some (Q,P) epsi M(N<sub>Delta</sub>), if N < N<sub>Delta</sub>.