Why is sentence similarity benchmark not predictive of application-oriented task performance?