Research on Attribute Dimension Partition Based on SVM Classifying and MapReduce

Abstract The data analysis is closely related to data attribute dimension. The traditional extraction and partition of data attribute dimension is so manual and inefficiency as to not meet the needs of analysing big data. This paper proposed an attribute dimension partition scheme based on SVM classifying and MapReduce for analysing big data. This scheme improve traditional SVM classifying method by combining Euclidean distance theory for overcoming its disadvantages, and adopts punish coefficient to reduce the unbalance of data distribution. With the improved SVM classifying method, the implementation of attribute dimension partition take MapReduce model of Hadoop as process engine, use TF–IDF vector to save the extracted attribute dimension, and use k-means clustering algorithm to clustering partition. The experiment result shows that the execution efficiency of the proposed method is enhanced, and while the rationality of partition is guaranteed, the increasing of data attributes does not significantly increase the execution time.

[1]  Dongmin Yang,et al.  Classification Scheme of Unstructured Text Document using TF-IDF and Naive Bayes Classifier , 2015 .

[2]  Jordán Pascual Espada,et al.  Machine learning approach for text and document mining , 2014, ArXiv.

[3]  P. Anitha,et al.  Efficient classification mechanism for network intrusion detection system based on data mining techniques: A survey , 2014, 2014 IEEE 8th International Conference on Intelligent Systems and Control (ISCO).

[4]  Wenbin Zhao,et al.  Research on Engineering Software Data Formats Conversion Network , 2012, J. Softw..

[5]  Jaroslav Pokorný,et al.  NoSQL databases: a step to database scalability in web environment , 2011, iiWAS '11.

[6]  S. Archana,et al.  Survey of Classification Techniques in Data Mining , 2014 .

[7]  Wang Hongsheng,et al.  Research on Data Security Mechanism among Cloud Services based on Software Define Network , 2017 .

[8]  K. Bala Sindhuri,et al.  Implementation of 32-Bit Carry Select Adder using Brent-Kung Adder , 2016 .

[9]  Vojtech Huser,et al.  Process Mining: Discovery, Conformance and Enhancement of Business Processes , 2012, J. Biomed. Informatics.

[10]  M. V. Judy,et al.  Analytical Study of Selected Classification Algorithms for Clinical Dataset , 2016 .

[11]  Di Xiao,et al.  An efficient and noise resistive selective image encryption scheme for gray images based on chaotic maps and DNA complementary rules , 2014, Multimedia Tools and Applications.

[12]  Serpil Sayin,et al.  SVM classification for imbalanced data sets using a multiobjective optimization framework , 2014, Ann. Oper. Res..

[13]  Nikhil N. Salvithal,et al.  Appraisal Management System using Data mining Classification Technique , 2016 .

[14]  S. Olalekan Akinola,et al.  Accuracies and Training Times of Data Mining Classification Algorithms: An Empirical Comparative Study , 2015 .

[15]  Thiago Luís Lopes Siqueira,et al.  Spatial data warehouses and spatial OLAP come towards the cloud: design and performance , 2015, Distributed and Parallel Databases.

[16]  lt,et al.  Scalable OLAP Queries Processing Towards Large Cluster , 2015 .

[17]  Azhar Rauf,et al.  Semantics discovery in social tagging systems: A review , 2014, Multimedia Tools and Applications.

[18]  Boualem Benatallah,et al.  Scalable graph-based OLAP analytics over process execution data , 2015, Distributed and Parallel Databases.

[19]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[20]  Karim Keshavjee,et al.  Performance Analysis of Data Mining Classification Techniques to Predict Diabetes , 2016 .

[21]  Wil M. P. van der Aalst,et al.  Process Cubes: Slicing, Dicing, Rolling Up and Drilling Down Event Data for Process Mining , 2013, AP-BPM.