Naming multi-modal clusters to identify persons in TV broadcast