Self-supervised Video-centralised Transformer for Video Face Clustering