Active Learning with Clustering for Mining Big Data

Big data mining is become a key research issue nowadays. It’s costly and also time-consuming to extract knowledge from big data. Big data is so big, it contains millions of data points that’s why it’s very difficult to build a learning model using machine learning and data mining algorithms. The main problem is to fit the hole data into the computer memory, which is quite impossible. Therefore, we need more scalable, robust, and adaptive learning algorithms. The exiting mining algorithms are design to handle relatively small datasets with fix number of class labels. In this paper, we have proposed a new method to select a few/ less number of training instances that we consider them as informative instances from a set of large data/ big data using clustering techniques. We have applied our proposed method in active leaning process for classifying big data. Active learning is a machine learning process in supervised learning where an oracle is ask to label the unlabelled training instances. It’s very challenging and difficult task for connoisseur to label a large number of unlabelled data. Therefore, finding informative unlabelled training instances is necessary for learning from big semi-supervised data. We have collected six benchmark datasets from UCI machine learning repository and tested our proposed method using following machine learning algorithms: näıve Bayes (NB) Classifier, decision tree (DT) classifier (i.e. C4.5 and CART), Support Vector Machines (SVM), Random Forest, Bagging, and Boosting (AdaBoost). This work is devoted to our mother and father.

[1]  W. Reinartz,et al.  The mismanagement of customer loyalty. , 2002, Harvard business review.

[2]  Dewan Md. Farid,et al.  Active Learning for Mining Big Data , 2018, 2018 21st International Conference of Computer and Information Technology (ICCIT).

[3]  Wei Fan,et al.  Mining big data: current status, and forecast to the future , 2013, SKDD.

[4]  Ahmed Elragal,et al.  ERP and Big Data: The Inept Couple☆ , 2014, CENTERIS 2014.

[5]  Miriam A. M. Capretz,et al.  Machine Learning With Big Data: Challenges and Approaches , 2017, IEEE Access.

[6]  Li Zhang,et al.  An adaptive ensemble classifier for mining concept drifting data streams , 2013, Expert Syst. Appl..

[7]  Dewan Md. Farid,et al.  A feature grouping method for ensemble clustering of high-dimensional genomic big data , 2016, 2016 Future Technologies Conference (FTC).

[8]  M Chopp,et al.  Angiopoietin-1 reduces cerebral blood vessel leakage and ischemic lesion volume after focal cerebral embolic ischemia in mice , 2002, Neuroscience.

[9]  Dewan Md. Farid,et al.  Ensemble of Trees for Classifying High-Dimensional Imbalanced Genomic Data , 2016 .

[10]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[11]  Li Zhang,et al.  Hybrid decision tree and naïve Bayes classifiers for multi-class classification tasks , 2014, Expert Syst. Appl..

[12]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[13]  M. Hemalatha,et al.  Perspective analysis of machine learning algorithms for detecting network intrusions , 2012, 2012 Third International Conference on Computing, Communication and Networking Technologies (ICCCNT'12).

[14]  Abdollah Dehzangi,et al.  iDTI-ESBoost: Identification of Drug Target Interaction Using Evolutionary and Structural Features with Boosting , 2017, Scientific Reports.

[15]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[16]  Dewan Md. Farid,et al.  An Ensemble Clustering For Mining High-dimensional Biological Big Data , 2016 .

[17]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[18]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[19]  Cevriye Gencer,et al.  Yesterday, Today and Tomorrow of Big Data , 2015 .

[20]  Abdollah Dehzangi,et al.  CFSBoost: Cumulative feature subspace boosting for drug-target interaction prediction. , 2019, Journal of theoretical biology.

[21]  Naveen Garg,et al.  Challenges and Techniques for Testing of Big Data , 2016 .

[22]  Francisco Herrera,et al.  Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data , 2015, Fuzzy Sets Syst..

[23]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[24]  Bernard Manderick,et al.  An adaptive rule-based classifier for mining big biological data , 2016, Expert Syst. Appl..