论文信息 - MLlib: Machine Learning in Apache Spark

MLlib: Machine Learning in Apache Spark

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

[1] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[2] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3] Roberto J. Bayardo,et al. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[4] Yehuda Koren,et al. Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[5] Graham J. Williams,et al. PMML: An Open Standard for Sharing Models , 2009, R J..

[6] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[8] Tim Kraska,et al. MLbase: A Distributed Machine-learning System , 2013, CIDR.

[9] Gilles Louppe,et al. Independent consultant , 2013 .

[10] Jure Leskovec,et al. Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[11] Tim Kraska,et al. MLI: An API for Distributed Machine Learning , 2013, 2013 IEEE 13th International Conference on Data Mining.

[12] Scott Shenker,et al. Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[13] Reynold Xin,et al. GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[14] Tim Kraska,et al. TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries , 2015, ArXiv.

[15] Tim Kraska,et al. Automating model search for large scale machine learning , 2015, SoCC.

[16] Joseph K. Bradley,et al. Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[17] Muthu Dayalan,et al. MapReduce : Simplified Data Processing on Large Cluster , 2018 .