Unified NDP method based on TD(0) learning for both average and discounted Markov decision processes