PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining