Video Joint Modelling Based on Hierarchical Transformer for Co-summarization