We study multiagent learning in a simulated soccer scenario. Players from
the same team share a common policy for mapping inputs to actions. They
get rewarded or punished collectively in case of goals. For varying team
sizes we compare the following learning algorithms: TD-Q learning with
linear neural networks (TD-Q-LIN), with a neural gas network (TD-Q-NG),
Probabilistic Incremental Program Evolution (PIPE), and a PIPE variant
based on coevolution (CO-PIPE). TD-Q-LIN and TD-Q-NG try to learn
evaluation functions (EFs) mapping input/action pairs to expected reward.
PIPE and CO-PIPE search policy space directly. They use adaptive
probability distributions to synthesize programs that calculate action
probabilities from current inputs. We find that learning appropriate EFs
is hard for both EF-based approaches. Direct search in policy space
discovers more reliable policies and is faster.