Improve Quora Question Pair Dataset for Question Similarity Task

Automatic detection of semantically equivalent questions is a task of the utmost importance in a question answering system. The Quora dataset, which was released in the Quora Question Pairs competition organized by Kaggle, has now been used by many researches to train the system in solving the task of identifying duplicate questions. However, the ground truth labels on this dataset are not 100% accurate and may include incorrect labeling. In this paper, we concentrate on improving the quality of the Quora dataset by combining several strategies, basing on Bert, rules, and reassigning labels by humans.