Performance Evaluation of Mahout Clustering Algorithms Using a Twitter Streaming Dataset

Big Data has become commonplace in most Internet-based applications, which by delivering services to planetary scale numbers of users generate very large data sets. Such data sets are considered as a valuable source of analytics information and knowledge for many purposes and domains. It is claimed each time more that Big Data and machine learning, especially data mining, are the basis for developing advanced analytics platforms for turning data into valuable assets, gaining competitive advantage and make better decisions. At the same time, however, Big Data applications are showing to be killer applications for the state of the art machine learning and data mining algorithms. Indeed, traditional data mining frameworks such as WEKA, R, etc. and those from big companies such as IBM SPSS Modeler, SAS Enterprise Miner, Oracle Data Mining, etc. are facing the challenges of 1) coping with mining large data sets within short times and 2) under high rates of data generation. The way envisaged ahead to effectively deal with such challenges is to move to Cloud-based versions of such frameworks and development of new frameworks implemented using Cloud platforms. In either case, data mining and machine learning algorithms are being fully implemented in Cloud platforms under new requirements of Big Data for efficiency and performance. In the group of newly developed frameworks there is Apache Mahout, whose goal is "to build an environment for quickly creating scalable performant machine learning applications". In this paper we analyse the performance of some clustering algorithms of Apache Mahout using a Twitter streaming dataset under a Hadoop MapReduce cluster infrastructure according to various evaluation criteria.

[1]  Chuan-Ming Liu,et al.  Big data stream computing in healthcare real-time analytics , 2016, 2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA).

[2]  Mihaela van der Schaar Real-time discovery and decision making from big data , 2014, 2014 IEEE International Conference on Consumer Electronics - Taiwan.

[3]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[4]  Fatos Xhafa,et al.  Detection of the onset of agitation in patients with dementia: real-time monitoring and the application of big-data solutions , 2013, Int. J. Space Based Situated Comput..

[5]  A. Nurnberger How Big is Big Data , 2013 .

[6]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[7]  Michael C. Schatz,et al.  High performance computing for dna sequence alignment and assembly , 2010, HiPC 2010.

[8]  Fatos Xhafa,et al.  'NoSQL' and Electronic Patient Record Systems: Opportunities and Challenges , 2014, 2014 Ninth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[9]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[10]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[11]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[12]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[13]  Lawrence D. Fu,et al.  Identifying Unproven Cancer Treatments on the Health Web: Addressing Accuracy, Generalizability and Scalability , 2013, MedInfo.

[14]  Baomin Xu,et al.  An efficient algorithm for DNA fragment assembly in MapReduce. , 2012, Biochemical and biophysical research communications.

[15]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[16]  Fatos Xhafa,et al.  Mining Navigation Patterns in a Virtual Campus , 2012, 2012 Third International Conference on Emerging Intelligent Data and Web Technologies.

[17]  Rajeev Motwani,et al.  Scalable Techniques for Mining Causal Structures , 1998, Data Mining and Knowledge Discovery.

[18]  Fatos Xhafa,et al.  Processing and Analytics of Big Data Streams with Yahoo!S4 , 2015, 2015 IEEE 29th International Conference on Advanced Information Networking and Applications.

[19]  Miroslaw Malek,et al.  Comprehensive logfiles for autonomic systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[20]  Fatos Xhafa,et al.  Scalability, Memory Issues and Challenges in Mining Large Data Sets , 2014, 2014 International Conference on Intelligent Networking and Collaborative Systems.

[21]  Fatos Xhafa,et al.  Distributed-based massive processing of activity logs for efficient user modeling in a Virtual Campus , 2013, Cluster Computing.

[22]  Brian Lee,et al.  An example of the use of Public Health Grid (PHGrid) technology during the 2009 H1N1 influenza pandemic , 2011, Int. J. Grid Util. Comput..

[23]  Fatos Xhafa,et al.  Performance Evaluation of a MapReduce Hadoop-Based Implementation for Processing Large Virtual Campus Log Files , 2015, 2015 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC).

[24]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[25]  Inder Monga,et al.  Lambda architecture for cost-effective batch and speed big data processing , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[26]  Ian T. Foster,et al.  A distributed look-up architecture for text mining applications using mapreduce , 2011, HPDC.