论文信息 - Analyzing mobile phone usage using clustering in Spark MLLib and Pig

Analyzing mobile phone usage using clustering in Spark MLLib and Pig

K-means is a common method of clustering data points using a predefined number of clusters. Apache Spark is a computing technology used for fast computation of data. By making use of its machine learning library called MLLib, we analyze mobile data obtained from Opencellid.org by clustering according to latitude and longitude values ,using K-means algorithm. Once each data point is assigned its cluster number , the dataset is loaded into Apache Pig to calculate the number of users in each cluster. Thus, we can analyse the number of users using a mobile network in a particular range of latitude and longitude. Keywords: Spark, Pig, clustering, mobile, data, analysis

Shefali Arora

[1] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[2] Sotiris B. Kotsiantis,et al. Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[3] James G. Shanahan,et al. Large Scale Distributed Data Science using Apache Spark , 2015, KDD.

[4] D.M. Mount,et al. An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5] Rohan Arora,et al. Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means , 2015 .

[6] Bill Nitzberg,et al. Distributed shared memory: a survey of issues and algorithms , 1991, Computer.

[7] Paul S. Bradley,et al. Refining Initial Points for K-Means Clustering , 1998, ICML.

[8] Mohamed Sarwat,et al. GeoSpark: a cluster computing framework for processing large-scale spatial data , 2015, SIGSPATIAL/GIS.