In order to overcome the lack of the multispectral image (MS) and adequately preserve the spatial information of panchromatic (PAN) image and the spectral information of MS image, this study proposes a method which adds the spectral information of the prior MS to the prior PAN during training, and only the posterior PAN is needed for predicting. Firstly, we introduce the autoencoder model based on image colorization and discuss its feasibility in the field of multi-band remote sensing image pan-sharpening. Then, the image quality evaluation functions including spatial and spectral indexes are formed as the loss function to control the image colorization model. Because the loss function contains spatial and spectral evaluation indexes, it could directly calculate the loss between the network output and the label considering characteristics of remote sensing images. Besides, the training data in our model is original PAN, this means that it is not necessary to make the simulated degraded MS and PAN data for training which is a big difference from most existing deep learning pan-sharpening methods. The new loss function including the spectral and spatial quality instead of the general MSE (mean square error), only the original PAN instead of the simulated degraded MS + PAN to be inputted, only the spectral feature instead of the direct fusion result to be learned, these three aspects change the current learning framework and optimization rule of deep learning pan-sharpening. Finally, thousands of remote sensing images from different scenes are adopted to make the training dataset to verify the effectiveness of the proposed method. In addition, we selected seven representative pan-sharpening algorithms and four widely recognized objective fusion metrics to evaluate and compare the performance on the WorldView-2 experimental data. The results show that the proposed method achieves optimal performance in terms of both the subjective visual effect and the object assessment.