Most of the classification methods of urban functional areas nowadays are only based on single source data analysis and modeling, which can not make full use of the multi-scale and multi-source data that is easy to obtain. Therefore, this paper proposed a classification model of urban functional areas based on multi-modal machine learning, by analyzing regional remote sensing images and behavior data of visitors in the area, using the combination of supervised methods extracted the deep-seated features and relationships of kinds of data, filtered and merged the overall and local features of the data. The model used dual branch neural network combining SE-ResNeXt and Dual Path Network (DPN) to automatically mined and fused the overall characteristics of multi-source data, and used the designed feature engineering to deeply mine the behavior data of users to obtain more association information, then combined the algorithm based on Gradient Boosting Decision Tree to learn the characteristics of different levels and obtained the classification probability for different levels of features. Finally, we continued to use the algorithm based on the Gradient Boosting Decision Tree to learn the probability distribution of different levels of features to obtain the final prediction results of urban functional area classification. Through the analysis and experimental verification of real data sets, the results showed that MM-UrbanFAC model can effectively integrate the features of multi-modal data. Compared with a single classifier, the integration framework based on gradient lifting tree improved the prediction performance, this method can effectively integrate the results of multiple models and accurately classify urban functional areas, and the model can provide reference for tourism recommendation, urban land planning and urban construction.