Swarm intelligent based online feature selection (OFS) and weighted entropy frequent pattern mining (WEFPM) algorithm for big data analysis

During the past two decades, frequent pattern mining (FPM) has acquired the interests of many researchers: which involves extracting the itemsets from transactions, sequences from big dataset, which occurs frequently and to recognize from the molecular structures, the common sub graph. In this big data era, the unpredictable flow and huge quantity of data brings new challenges in FPM such as space and time complexity. In general, most of the research work focus on recognizing the patterns that occurs frequently, from the set of specific data, where the patterns within every transaction were definitely known a priori. Among these, the users focus only on the small part of this FP. In order to tackle such problems in the current scenario, it is necessary sometimes to select the important features alone, using appropriate FPM algorithms, in order to reduce the complexity level. The major objective of this work is to improve FPM mining results and improve classification accuracy of big dataset samples. To tackle the first challenge, the levy flight bat algorithm (LFBA) along with online feature selection (OFS) approach is proposed, which is used to filter the low quality features from the big data in an online manner. Subsequently to address the second challenge, a weighted entropy frequent pattern mining (WEFPM) is enforced for FPM, to accomplish better computation time when compared with other methods such as direct discriminative pattern mining (DDPMine) and iterative sampling based frequent itemset mining (ISbFIM), where enumeration of entire feature combinations were completed. So the WEFPM algorithm employed in this paper, targets to identify only the specific frequent patterns which are required by the user. By iterating this procedure, it assures that the acquired frequent patterns can be enumerated by using both the theoretical and empirical research, so that enumeration doesn’t proceed into a combinatorial explosion. And also, using the above said LFBA–OFS approach and WEFPM algorithm, frequent patterns that are different in nature, are generated for building high quality learning model. For finding the frequent patterns, here the minimum support threshold is matched with entropy. As a final step, multiple Kernel learning support vector machine is employed as a classifier, to evaluate the performance of the big data samples for efficiency and accuracy. Empirical study reveal that considerable progress is obtained in terms of accuracy and computation time when applied to UCI benchmark big datasets, using the proposed approach for efficient and effective FPM of the online features. It is clear that WEFPM is the most efficient method, because it produces higher average accuracy results of 92.34, 93.218, 91.374 and 87.87% values for adult, chess, hybo and sick dataset respectively. It outperforms when compared to other methods such as DDPMine and ISbFIM using an LIBSVM classifier.

[1]  Hui Cao,et al.  Approximate RBF Kernel SVM and Its Applications in Pedestrian Classification , 2008 .

[2]  Philip S. Yu,et al.  Direct Discriminative Pattern Mining for Effective Classification , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[3]  Rong Gu,et al.  YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[4]  Takashi Washio,et al.  State of the art of graph-based data mining , 2003, SKDD.

[5]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[6]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[7]  Zhe Chen,et al.  An Overview of Bayesian Methods for Neural Spike Train Analysis , 2013, Comput. Intell. Neurosci..

[8]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[9]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Anthony K. H. Tung,et al.  Mining top-K covering rule groups for gene expression data , 2005, SIGMOD '05.

[12]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[13]  David G. Stork,et al.  Pattern Classification , 1973 .

[14]  Xin-She Yang,et al.  Eagle Strategy Using Lévy Walk and Firefly Algorithms for Stochastic Optimization , 2010, NICSO.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Christian Borgelt,et al.  Induction of Association Rules: Apriori Implementation , 2002, COMPSTAT.

[17]  Srinivasan Parthasarathy,et al.  Parallel Algorithms for Discovery of Association Rules , 1997, Data Mining and Knowledge Discovery.

[18]  Anthony K. H. Tung,et al.  Carpenter: finding closed patterns in long biological datasets , 2003, KDD '03.

[19]  Mark Oskin,et al.  Quantum computing , 2008, CACM.

[20]  Rong Jin,et al.  Online Feature Selection and Its Applications , 2014, IEEE Transactions on Knowledge and Data Engineering.

[21]  Mahdi Hasanlou,et al.  A COMPARISON STUDY OF DIFFERENT KERNEL FUNCTIONS FOR SVM-BASED CLASSIFICATION OF MULTI-TEMPORAL POLARIMETRY SAR DATA , 2014 .

[22]  Magda B. Fayek,et al.  Frequent Itemset Mining for Big Data Using Greatest Common Divisor Technique , 2017, Data Sci. J..

[23]  Philip S. Yu,et al.  Direct mining of discriminative and essential frequent patterns via model-based search tree , 2008, KDD.

[24]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[25]  Kun Zhang,et al.  Iterative sampling based frequent itemset mining for big data , 2015, Int. J. Mach. Learn. Cybern..

[26]  R. Mantegna,et al.  Fast, accurate algorithm for numerical simulation of Lévy stable stochastic processes. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[27]  Selim Yilmaz,et al.  A new modification approach on bat algorithm for solving optimization problems , 2015, Appl. Soft Comput..

[28]  Sabeur Aridhi,et al.  Density-based data partitioning strategy to approximate large-scale subgraph mining , 2012, Inf. Syst..

[29]  Ming-Yen Lin,et al.  Apriori-based frequent itemset mining algorithms on MapReduce , 2012, ICUIMC.

[30]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[31]  Anthony K. H. Tung,et al.  COBBLER: combining column and row enumeration for closed pattern discovery , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[32]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[33]  Masaru Kitsuregawa,et al.  Parallel mining algorithms for generalized association rules with classification hierarchy , 1997, SIGMOD '98.

[34]  Ada Wai-Chee Fu,et al.  Mining association rules with weighted items , 1998, Proceedings. IDEAS'98. International Database Engineering and Applications Symposium (Cat. No.98EX156).

[35]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[36]  John L. Faundeen,et al.  Developing Criteria to Establish Trusted Digital Repositories , 2017, Data Sci. J..

[37]  Jian Xie,et al.  A Novel Bat Algorithm Based on Differential Operator and Lévy Flights Trajectory , 2013, Comput. Intell. Neurosci..

[38]  Guizhen Yang,et al.  Computational aspects of mining maximal frequent patterns , 2006, Theor. Comput. Sci..

[39]  Rong Jin,et al.  Online feature selection for mining big data , 2012, BigMine '12.

[40]  O. Hasançebi,et al.  A bat-inspired algorithm for structural optimization , 2013 .