Hierarchical Self-supervised Representation Learning for Movie Understanding