We consider the class of strong-mixing probability laws with positive transitions that are defined on doubly infinite sequences in a finite alphabet A. A device called the classifier (or discriminator) observes a training sequence whose probability law Q is unknown. The classifier's task is to consider a second probability law P and decide whether P = Q, or P and Q are sufficiently different according to some appropriate criterion Delta(Q,P) > Delta. If the classifier has available an infinite amount of training data, this is a simple matter. However, here we study the case where the amount of training data is limited to N letters. We define a function N<sub>Delta</sub>(Q|P), which quantifies the minimum length sequence needed to distinguish Q and P and the class M(N<sub>Delta</sub>) of all probability laws pairs (Q,P) that satisfy N<sub>Delta</sub>(Q|P) les N<sub>Delta</sub> for some given positive number N<sub>Delta</sub>. It is shown that every pair Q,P of probability laws that are sufficiently different according to the Delta criterion is contained in M(N<sub>Delta</sub>). We demonstrate that for any universal classifier there exists some Q for which the classification probability lambda(Q) = 1 for some N-sequence emerging from Q, for some P : (Q,P) epsi M circ(N<sub>Delta</sub>).Delta(Q,P) > Delta, if N < N<sub>Delta</sub>. Conversely, we introduce a classification algorithm that is essentially optimal in the sense that for every (Q,P) epsi M(N<sub>Delta</sub>), the probability of classification error lambda(Q) is uniformly vanishing with N for every P : (Q,P) epsi M circ(N<sub>Delta</sub>) if N ges N<sub>Delta</sub> <sup>1+O(log</sup> <sup>log</sup> <sup>N</sup> <sup>Delta</sup> <sup>/log</sup> <sup>N</sup> <sup>Delta</sup> <sup>)</sup>. The proposed algorithm finds the largest empirical conditional divergence for a set of contexts which appear in the tested N-sequence. The computational complexity of the classification algorithm is <i>O</i>(N<sup>2</sup>(log N)<sup>3</sup>). Also, we introduce a second simplified context classification algorithm with a computational complexity of only <i>O</i>(N(log N)<sup>4</sup>) that is efficient in the sense that for <i>every</i> <i>pair</i> (Q,P) epsi M(N<sub>Delta</sub>), the <i>pairwise</i> probability of classification error lambda(Q,P) for the pair Q,P vanishes with N if N ges N<sub>Delta</sub> <sup>1+O(log</sup> <sup>log</sup> <sup>N</sup> <sup>Delta</sup> <sup>/log</sup> <sup>N</sup> <sup>Delta</sup> <sup>)</sup>. Conversely, lambda(Q,P) = 1 at least for some (Q,P) epsi M(N<sub>Delta</sub>), if N < N<sub>Delta</sub>.
[1]
Jacob Ziv,et al.
On fixed-database universal data compression with limited memory
,
1997,
IEEE Trans. Inf. Theory.
[2]
A. D. Wyner,et al.
The sliding-window Lempel-Ziv algorithm is asymptotically optimal
,
1994,
Proc. IEEE.
[3]
Jacob Ziv,et al.
Correction to: "An Efficient Universal Prediction Algorithm for Unknown Sources With Limited Training Data"
,
2004,
IEEE Transactions on Information Theory.
[4]
Gadiel Seroussi,et al.
Linear time universal coding and time reversal of tree sources via FSM closure
,
2004,
IEEE Transactions on Information Theory.
[5]
Aaron D. Wyner,et al.
Classification with finite memory
,
1996,
IEEE Trans. Inf. Theory.
[6]
Jacob Ziv.
A universal prediction lemma and applications to universal data compression and prediction
,
2001,
IEEE Trans. Inf. Theory.
[7]
Jacob Ziv,et al.
On Finite Memory Universal Data Compression and Classification of Individual Sequences
,
2006,
IEEE Transactions on Information Theory.
[8]
Jacob Ziv,et al.
An efficient universal prediction algorithm for unknown sources with limited training data
,
2002,
IEEE Trans. Inf. Theory.