Big data mining using supervised machine learning approaches for Hadoop with Weka distribution

Data is increasing very rapidly with the increase in technologies. To process this data and performing accurate mining to yield conclusions is a challenge. This domain to process and mining this big data is termed as big data mining. To store and process big data many open source tools were proposed and are present in Apache foundation. Apache Hadoop is the most widely used tool for big data processing. Apache Hadoop consists of two main components namely Hadoop distributed file system (HDFS) and map/reduce. HDFS is used to store the data in distributed form and map/reduce is used to process this distributed spread data. In the past many data mining and classification approaches have been proposed for big data in which for implementing machine learning no standard tool is used. And no generic topology for data flow is proposed to implement such model. And accuracy of classification for raw dataset is also poor. In this dissertation to perform big data mining Apache Hadoop and Weka is used. Weka is an open source tool for machine learning proposed by Waikato university of New Zealand. Here in this work Apache Hadoop is connected with Weka. Using this combination big data is stored on HDFS and processed using Weka using Knowledge flow of Weka. Knowledge flow provides a means to construct topologies using them HDFS components can be used to provide data to machine learning algorithms provided in Weka. In this work supervised machine learning approaches which include Naïve Bayes, Support vector machine, J48 are used for big data mining. The accuracy of these approaches is compared for raw data and normalized data given to the same topology. It is found proposed approach for big data mining yields better results as compared to the reference approach.

[1]  Leif E. Peterson K-nearest neighbor , 2009, Scholarpedia.

[2]  Rajanish Dass,et al.  Mining Frequent Item sets in Data Streams , 2008 .

[3]  Wang Guohua,et al.  Data Mining: Concept, Aplications and Techniques , 2017 .

[4]  Elena Baralis,et al.  Analysis of diabetic patients through their examination history , 2013, Expert Syst. Appl..

[5]  Pier Luca Lanzi,et al.  Mining interesting knowledge from weblogs: a survey , 2005, Data Knowl. Eng..

[6]  David F. Nettleton,et al.  Data mining of social networks represented as graphs , 2013, Comput. Sci. Rev..

[7]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[8]  Peter Svec,et al.  Data preprocessing evaluation for web log mining: reconstruction of activities of a web visitor , 2010, ICCS.

[9]  Dong Hoon Lee,et al.  Data-mining based SQL injection attack detection using internal query trees , 2014, Expert Syst. Appl..

[10]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[11]  Luca Cagliero,et al.  Improving classification models with taxonomy information , 2013, Data Knowl. Eng..

[12]  Haibin Liu,et al.  Combined mining of Web server logs and web contents for classifying user navigation patterns and predicting users' future requests , 2007, Data Knowl. Eng..

[13]  Amartya Singh,et al.  Application of data mining techniques in bioinformatics , 2007 .

[14]  Dino Pedreschi,et al.  Web log data warehousing and mining for intelligent web caching , 2001, Data Knowl. Eng..

[15]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[16]  James E. Pitkow,et al.  Characterizing Browsing Strategies in the World-Wide Web , 1995, Comput. Networks ISDN Syst..