论文信息 - Spark-Based Machine Learning Pipeline Construction Method

Spark-Based Machine Learning Pipeline Construction Method

As the amount of data captured in the industry continues to increase, and the accuracy requirements for machine learning algorithms continue to increase, stand-alone computing is far from meeting the needs of machine learning for computing speed and storage capacity. At the same time, distributed machine learning is facing the situation of high learning cost and inadequate intelligence in building models. In this paper, a pipeline method for constructing data processing, feature engineering, model training, evaluation, and prediction is proposed. It uses concurrent means to speed up data processing. And we use the method to implement a Spark-based machine learning visualization system. It reduces the threshold of distributed machine learning and verifies the feasibility of the method.

[1] Veda C. Storey,et al. Big data technologies and Management: What conceptual modeling can do , 2017, Data Knowl. Eng..

[2] Huang Yihua,et al. Research Progress on Big Data Machine Learning System , 2015 .

[3] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[4] Christophe Nicolle,et al. Understandable Big Data: A survey , 2015, Comput. Sci. Rev..

[5] Joseph K. Bradley,et al. Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[6] K. Murphy,et al. Overview of Machine Learning , 2022, International Journal of Advanced Research in Science, Communication and Technology.

[7] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8] Mehdi T. Harandi,et al. Workshop on software specification and design , 1988, SOEN.

[9] Shirish Tatikonda,et al. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML , 2014, Proc. VLDB Endow..