On the evaluation of independent binary features (Corresp.)

248 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-24, NO. 2, MARCH 1978 automaton that is close to optimal and eliminates the need for artificial randomization was also provided. This automaton is close to optimal in the sense that it requires at most 2 extra bits of memory, independent of m, to match the performance of the optimal randomized m-state automaton for all PA and PB. Both the problems studied here, however, involve only 2 coins. How to extend the results of this paper to situations where more than 2 coins are involved is an open question. Some ad hoc expedient automata are available in the literature [6], [7]. Before an optimal solution to the many-armed bandit problem is possible, the problem of multiple hypothesis testing with finite memory needs to be solved. For some recent results concerning this problem, see [13]. Further, finite time finite memory solutions to these problems are of interest. Vasilev [14] and Witten [15] studied the finite time behavior of some solutions to the TABP. No optimal solution, however, is available. Some recent progress has been reported by Cover et al. [16]. ACKNOWLEDGMENT The authors thank the referees for comments which helped to improve the paper. APPENDIX Denote by p(a;pA,PB) the asymptotic proportion of heads achieved, given the coins A and B and the automaton a. Even though p (a;pA,PB) is maximized over all m-state automata if and only if r (a;pA,PB) is maximized, maximizing {inf p (a;pA,PB)} is not necessarily equivalent to maximizing linf r(a;pA,Ps)} where the infimum is over {(PA,PS)}. In fact, for the TABPO, where PS is known precisely, an automaton that maximizes {infp(a;pA,PB)} tosses coin B exclusively. Furthermore, this automaton is not even expedient, and thus in some sense this solution is unsatisfactory. REFERENCES [1] H. Robbins, "Some Aspects of the sequential design of experiments," Bull. Am. Math. Soc., vol. 58, pp. 527-535, 1952. [2] H. Robbins, "A sequential decision problem with a finite memory," Proc. Nat'l. Acad. Sci., vol. 42, pp. 920-923, 1956. [3] I. H. Witten, "The apparent conflict between estimation and control--A survey of the two-armed bandit problem," J. Franklin Institute, vol. 301, no. 1-2, pp. 161-190, Jan.-Feb. 1976. [4] T. Cover and M. E. Hellman, "The two-armed bandit problem with time-invariant finite memory," IEEE Trans. Inform. Theory, vol. IT-16, No. 2, pp. 185-195, Mar. 1970. [5] M.E. Hellman and T. Cover, "Learning with finite memory," Ann. Math. Stat., vol. 41, pp. 765-782, June 1970. [6] M. L. Tsetlin, Automaton Theory and Modeling of Biological Systems. New York: Academic, 1973. [7] K.S. Fu and T. J. Li, "Formulation of learning automata and automata games," Information Sciences, vol. 1, no. 3, pp. 237-256, July 1969. [8] H. Chernoff, "Approaches in sequential design of experiments," in Statistical Design and Linear Models, J. N. Srivastava (Ed.). New York: American Elsevier, 1975, pp. 67-90. [9] S. J. Yakowitz, Mathematics of Adaptive Control Processes. New York: American Elsevier, 1969. [10] M. H. DeGroot, Optimal Statistical Decisions. New York: McGraw-Hin, 1970, Ch. 14. [11] K. B. Lakshmanan and B. Chandrasekaran, "Compound hypothesis testing with finite memory," submitted for publication. [12] A.A. Milyutin, "On automata with optimal expedient behavior in stationary media," Automation and Remote Control, vol. 26, pp. 116-131, 1965. [13] B. Chandrasekaran and K. B. Lakshmanan, "Multiple hypothesis testing with finite memory," Cybernetics and Information Science, 1977, to appear. [14] N.B. Vasilev and I. I. Pyatetskii-Shapiro, "The time for an automaton to adapt to the external medium," Automation and Remote Control, pp. 1100-1103, 1967. [15] I.H. Witten, "Finite time performance of some two-armed bandit controllers," IEEE Trans. Systems, Man, Cybern., vol. SMC-3, no. 3, pp. 194-197, Mar. 1973. [16]T. Cover, M. A. Freedman, and M. E. Hellman, "Optimal finite memory learning algorithms for the finite sample problem," Information and Control, vol. 30, pp. 49-85, Jan. 1976. On the Evaluation of Independent Binary Features ROBERT P. W. DUIN, CHRIS E. VAN HAERSMA BUMA, AND LUITZEN ROOSMA For the case of independent binary features, conditions under which the addition of a new feature does not decrease the Bayes error are derived. These conditions lead to illustrations of families of distributions for which the best two independent measurements are not the two best. I. INTRODUCTION We consider the problem of classifying the K-dimensional binary vector x = (x 1,x 2,... ,x K) into one of the two classes A and B. The features are assumed to be statistically independent for both classes so that the probability distribution of x, given class i, can be written K Fi(x): I~ {Pi xJ ']- (1 - pi)(1 -- x J)}, (1) j=l where i = A,B and Pi = Prob (xJ = l Ix e class i). The Bayes error e made by using (1) for classification is = ~. min {cFA(X), (1 -- c)FB(x)} (2) x in which c is the a priori probability for class A. The main purpose of this note is to investigate the effect upon ~ of the addition of a K + 1st feature. Conditions under which e does not decrease will be given, that is, cases in which the addition of a new feature does not result in an improvement of the probability of classification. The Bayes error (2) can be expressed in the contributions ex of all points x by = ~ exF(x), (3) x where F(x) = CFA (x) + (1 -- c)FB (x) is the probability of x. We will write the error d, if the dimensionality is raised from K to K + 1, as a sum over all points x of the K-dimensional space, let us say ~' = Y~ e'x F(x), (4) x where e'x can be interpreted as the probability of error for a given K-dimensional point :~ if the additional K + 1st feature is used. The probabilities e and e' will be compared by comparing ex and e'x for all points x. Let a(x) = (1 - c)FB(X)/IcFA(.X)} (5) be the probability ratio of class B to class A for a given point x. It can be shown (see [5[) that e'x -- ex when both pAK+i/pBK+l and (1 - pAK+i)/(1 -- pBK+l) are either simultaneously larger than a(x) or smaller than a(x). If this is valid for all x, then e' -- e, and the addition of the new feature give no improvement. When features with probabilities pA~+l and pBK+I are plotted in a (PA, PB) plane, there is an area (shaded in Fig. 1) where these conditions apply simultaneously. For the proof, see [4]. A feature in the shaded area of Fig. 1 therefore gives no improvement when it is added to the feature set. The important thing to note is that such a feature is not necessarily a feature such that pAK+l ---- p~+l. Manuscript received July 1, 1975; revised April 26, 1977. R. P. W. Duin and L. Roosma are with the Department of Applied Physics, Delft University of Technology, Delft, The Netherlands. C. E. van Haersma Buma is with Philips Audio Division, Eindhoven, The For the case of independent binary features, conditions under which the addition of a new feature does not decrease the Bayes error are derived. These conditions lead to illustrations of families of distributions for which the best two independent measurements are not the two best.