Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation