CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery