Dialog policy optimization for low resource setting using Self-play and Reward based Sampling