Deep Decentralized Multi-task Multi-Agent RL under Partial Observability

Figure 6. Visualization of MAMT domain. Agents and targets operate on a toroidal m × m gridworld. Each agent (circle) is assigned a unique target (cross) to capture, but does not observe its assigned target ID. Targets’ states are fully occluded at each timestep with probability Pf . Despite the simplicity of gridworld transitions, reward sparsity makes this an especially challenging task. During both learning and execution, the team receives no reward unless all targets are captured simultaneously by their corresponding agents.