Can Small Heads Help? Understanding and Improving Multi-Task Generalization