5.2. Improvement through Adding New Learning Methods 19 0

ing temporal diierences to create useful concepts for evaluating states. Acknowledgments Thank you to Jee Keller for constructing the new evaluation function, to Paul Zola and Kamal Mostafa for the initial Morph implementation, and to Richard Sutton for sharing our enthusiasm for reinforcement learning. Finally, we would like to thank Richard Snyder for valuable editing assistance. 21 Guided by appropriate performance measures, modiication and testing of the system proceeds systematically. Interesting ideas arise directly as a result of taking the multi-strategy view. The goal is to exploit the strength of individual methods while eliminating their weaknesses. Some examples: 1. The genetic mutation operator described in Chapter 0.3.3. 2. Higher level concepts via hidden units. Once a good set of patterns has been obtained it may be possible for the system to develop a more sophisticated evaluation function. This function, patterned after neural nets, would have hidden units that correspond to higher level interactions between the patterns. For example, conjunctions and disjunctions may be realized and given weights diierent from that implied by their components. 3. Clarity of system's knowledge The \meaning" of hidden units to which weights are associated in neural nets is usually not clear, whereas in experience-based systems it is speciic structures that are given weights. Indeed, it is the transparency of Morph's knowledge that has allowed its learning mechanisms to be ne tuned; with various system utilities it is possible to ascertain exactly why Morph is selecting one move over another. 20 6. Conclusions and Ongoing Directions 6. Conclusions and Ongoing Directions The development of a computational framework for experience-based learning is a dif-cult but important challenge. Here, we have argued for the necessity of a multi-strategy approach: At the least, an adaptive search system requires mechanisms for credit assignment , feature creation and deletion, weight maintenance and state evaluation. Further, it has been demonstrated that the TD error measure can provide a mechanism by which the system can monitor its own error rate and steer itself to smooth convergence. The error rate provides a metric more reened but well-correlated with the reinforcement values and more domain-speciic metrics. Finally, in a system with many components (pws) to be adjusted, learning rates should be allowed to diier across these components. Simulated annealing provides this capability. APS has produced encouraging results in a variety of domains studied as classroom projects (Levinson et al. application of the Morph-APS shell 1 …