Dense video captioning based on local attention