Efficient (Soft) Q-Learning for Text Generation with Limited Good Data