Spark-Based Machine Learning Pipeline Construction Method

As the amount of data captured in the industry continues to increase, and the accuracy requirements for machine learning algorithms continue to increase, stand-alone computing is far from meeting the needs of machine learning for computing speed and storage capacity. At the same time, distributed machine learning is facing the situation of high learning cost and inadequate intelligence in building models. In this paper, a pipeline method for constructing data processing, feature engineering, model training, evaluation, and prediction is proposed. It uses concurrent means to speed up data processing. And we use the method to implement a Spark-based machine learning visualization system. It reduces the threshold of distributed machine learning and verifies the feasibility of the method.