Enriched Over_Sampling Techniques for Improving Classification of Imbalanced Big Data

Big Data generated in exabytes per year has become a watchword of today's research. They are exceptionally afar from the capability of commonly used software tools and also beyond the handling possibility of the single machine architecture. Facing this challenge has activated a requisite to reexamine the data management options. The new avenues of NoSQL Big Data compared to the traditional forms has insisted on adapting experimental beds, helping to discover large unknown values from enormous data sets. Also, outmoded management systems and statistical packages express trouble handling Big Data. In numerous real applications, handling of imbalanced data sets is the fact of precedence. The classification of data sets having imbalanced class distribution has produced a notable drawback in performance obtained by the most standard classifier learning algorithms. Assuming balanced class distribution and equal misclassification costs lead to poor results. In a real-world domain, the classification methods of multi-class imbalance problem need more attention compared to the two-class problem. A methodology is presented for binary/multi-class imbalanced data sets with improved over_sampling (O. S.) techniques to enhance classification. The methods are broadly classified into two categories: non-clustered and cluster based advanced approach compared to prior work on O. S. techniques. The balanced data are subsequently analyzed for classification using various classifiers. Proposed techniques are performed using mapreduce environment on Apache Hadoop, using various data sets from UCI/KEEL repository. Fmeasures and ROC area are used to measure the performance of this classification.

[1]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[2]  Francisco Herrera,et al.  Evolutionary-based selection of generalized instances for imbalanced classification , 2012, Knowl. Based Syst..

[3]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[4]  María José del Jesús,et al.  Multi-class Imbalanced Data-Sets with Linguistic Fuzzy Rule Based Classification Systems Based on Pairwise Learning , 2010, IPMU.

[5]  Chang Ouk Kim,et al.  An Incremental Clustering-Based Fault Detection Algorithm for Class-Imbalanced Process Data , 2015, IEEE Transactions on Semiconductor Manufacturing.

[6]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[7]  Xue-wen Chen,et al.  Big Data Deep Learning: Challenges and Perspectives , 2014, IEEE Access.

[8]  William A. Rivera,et al.  Safe level OUPS for improving target concept learning in imbalanced data sets , 2015, SoutheastCon 2015.

[9]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[10]  Younghwan Namkoong,et al.  Ordinal Classification of Imbalanced Data with Application in Emergency and Disaster Information Services , 2016, IEEE Intelligent Systems.

[11]  Feng Hu,et al.  A Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model: NRSBoundary-SMOTE , 2013 .

[12]  Sachin S. Patil,et al.  Performance evaluation of categorizing technical support requests using advanced K-Means algorithm , 2015, 2015 IEEE International Advance Computing Conference (IACC).

[13]  Murtaza Haider,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[14]  Shefali Sonavane,et al.  Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification , 2017 .

[15]  Sachin S. Patil,et al.  Enhanced SMOTE algorithm for classification of imbalanced big-data using Random Forest , 2015, 2015 IEEE International Advance Computing Conference (IACC).

[16]  Yanheng Liu,et al.  A scalable random forest algorithm based on MapReduce , 2013, 2013 IEEE 4th International Conference on Software Engineering and Service Science.

[17]  Chumphol Bunkhumpornpat,et al.  DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique , 2011, Applied Intelligence.

[18]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[19]  Sung-Kwun Oh,et al.  The design of polynomial function-based neural network predictors for detection of software defects , 2013, Inf. Sci..

[20]  Hai Jiang,et al.  Scaling up MapReduce-based Big Data Processing on Multi-GPU systems , 2014, Cluster Computing.

[21]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[22]  Fabrice Kordon,et al.  Challenges and Opportunity with Big Data , 2016, Lecture Notes in Computer Science.

[23]  Manoj B. Chandak Role of big-data in classification and novel class detection in data streams , 2016, Journal of Big Data.

[24]  Yue-Shi Lee,et al.  Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset , 2006 .