Towards verifiable Benchmarks for Reinforcement Learning