Czech News Dataset for Semantic Textual Similarity

This paper describes a novel dataset consisting of sentences with two different semantic similarity annotations; with and without surrounding context. The data originate from the journalistic domain in the Czech language. The final dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the final annotations as an average of 9 individual annotation scores. We evaluate the dataset quality measuring inter and intra annotator agreements. Besides agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116,956), the model significantly outperforms an average annotator (0.92 versus 0.86 of Pearson’s correlation coefficient).