2-D latent space models: Layer-wise perceptual training and spatial grounding