MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth