Online feature selection for high-dimensional class-imbalanced data

When tackling high dimensionality in data mining, online feature selection which deals with features flowing in one by one over time, presents more advantages than traditional feature selection methods. However, in real-world applications, such as fraud detection and medical diagnosis, the data is high-dimensional and highly class imbalanced, namely there are many more instances of some classes than others. In such cases of class imbalance, existing online feature selection algorithms usually ignore the small classes which can be important in these applications. It is hence a challenge to learn from high-dimensional and class imbalanced data in an online manner. Motivated by this, we first formalize the problem of online streaming feature selection for class imbalanced data, and then present an efficient online feature selection framework regarding the dependency between condition features and decision classes. Meanwhile, we propose a new algorithm of Online Feature Selection based on the Dependency in K nearest neighbors, called K-OFSD. In terms of Neighborhood Rough Set theory, K-OFSD uses the information of nearest neighbors to select relevant features which can get higher separability between the majority class and the minority class. Finally, experimental studies on seven high-dimensional and class imbalanced data sets show that our algorithm can achieve better performance than traditional feature selection methods with the same numbers of features and state-of-the-art online streaming feature selection algorithms in an online manner.

[1]  Guoyin Wang,et al.  Incremental Attribute Reduction Based on Elementary Sets , 2005, RSFDGrC.

[2]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[3]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[4]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[5]  Qinghua Hu,et al.  Neighborhood rough set based heterogeneous feature subset selection , 2008, Inf. Sci..

[6]  Da Ruan,et al.  Neighborhood rough sets for dynamic data mining , 2012, Int. J. Intell. Syst..

[7]  Wei-Zhi Wu,et al.  Approaches to knowledge reduction based on variable precision rough set model , 2004, Inf. Sci..

[8]  Ali Shakiba,et al.  Neighborhood system S-approximation spaces and applications , 2016, Knowledge and Information Systems.

[9]  Qinghua Hu,et al.  Mixed feature selection based on granulation and approximation , 2008, Knowl. Based Syst..

[10]  Jing Wang,et al.  Online Feature Selection with Group Structure Analysis , 2015, IEEE Transactions on Knowledge and Data Engineering.

[11]  Xindong Wu,et al.  Towards Scalable and Accurate Online Feature Selection for Big Data , 2014, 2014 IEEE International Conference on Data Mining.

[12]  Dun Liu,et al.  Dynamic Maintenance of Approximations in Dominance‐Based Rough Set Approach under the Variation of the Object Set , 2013, Int. J. Intell. Syst..

[13]  Jing Zhou,et al.  Streamwise Feature Selection , 2006, J. Mach. Learn. Res..

[14]  Richard Weber,et al.  Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines , 2014, Inf. Sci..

[15]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[16]  Hamido Fujita,et al.  Parallel attribute reduction in dominance-based neighborhood rough set , 2016, Inf. Sci..

[17]  Rong Jin,et al.  Online Feature Selection and Its Applications , 2014, IEEE Transactions on Knowledge and Data Engineering.

[18]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[19]  Shi-Jinn Horng,et al.  Dynamic variable precision rough set approach for probabilistic set-valued information systems , 2017, Knowl. Based Syst..

[20]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[21]  Hao Wang,et al.  Online Feature Selection with Streaming Features , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Khalid Benabdeslem,et al.  Ensemble constrained Laplacian score for efficient and robust semi-supervised feature selection , 2015, Knowledge and Information Systems.

[23]  Jianzhong Li,et al.  A model-free and stable gene selection in microarray data analysis , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[24]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[25]  James Theiler,et al.  Online Feature Selection using Grafting , 2003, ICML.

[26]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[27]  Pradipta Maji,et al.  Rough set based maximum relevance-maximum significance criterion and Gene selection from microarray data , 2011, Int. J. Approx. Reason..

[28]  Rong Jin,et al.  Online feature selection for mining big data , 2012, BigMine '12.

[29]  Calton Pu,et al.  Evolutionary study of web spam: Webb Spam Corpus 2011 versus Webb Spam Corpus 2006 , 2012, 8th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom).

[30]  H. Hannah Inbarani,et al.  PSO-based feature selection and neighborhood rough set-based classification for BCI multiclass motor imagery task , 2017, Neural Computing and Applications.

[31]  Qing-Hua Hu,et al.  Numerical Attribute Reduction Based on Neighborhood Granulation and Rough Approximation: Numerical Attribute Reduction Based on Neighborhood Granulation and Rough Approximation , 2008 .

[32]  Xindong Wu,et al.  Subkilometer crater discovery with boosting and transfer learning , 2011, TIST.

[33]  Khalid Benabdeslem,et al.  Soft-constrained Laplacian score for semi-supervised multi-label feature selection , 2015, Knowledge and Information Systems.

[34]  Hamido Fujita,et al.  Incremental fuzzy probabilistic rough sets over two universes , 2017, Int. J. Approx. Reason..

[35]  Qiang Shen,et al.  Computational Intelligence and Feature Selection - Rough and Fuzzy Approaches , 2008, IEEE Press series on computational intelligence.

[36]  Dun Liu,et al.  Incremental updating approximations in probabilistic rough sets under the variation of attributes , 2015, Knowl. Based Syst..

[37]  Jing Wang,et al.  A survey on online feature selection with streaming features , 2018, Frontiers of Computer Science.

[38]  Hamido Fujita,et al.  An incremental attribute reduction approach based on knowledge granularity with a multi-granulation view , 2017, Inf. Sci..

[39]  Xindong Wu,et al.  LOFS: Library of Online Streaming Feature Selection , 2016, Knowl. Based Syst..

[40]  Guoyin Wang,et al.  RRIA: A Rough Set and Rule Tree Based Incremental Knowledge Acquisition Algorithm , 2003, Fundam. Informaticae.

[41]  Mingjie Cai,et al.  Knowledge reduction of dynamic covering decision information systems when varying covering cardinalities , 2016, Inf. Sci..

[42]  Jiawei Han,et al.  Generalized Fisher Score for Feature Selection , 2011, UAI.

[43]  Da Ruan,et al.  An Incremental Approach for Inducing Knowledge from Dynamic Information Systems , 2009, Fundam. Informaticae.

[44]  Shin Ando Classifying imbalanced data in distance-based feature space , 2015, Knowledge and Information Systems.

[45]  Pablo A. Estévez,et al.  A review of feature selection methods based on mutual information , 2013, Neural Computing and Applications.

[46]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2007 .

[47]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[48]  Guangming Lang,et al.  Three-way decision approaches to conflict analysis using decision-theoretic rough set theory , 2017, Inf. Sci..

[49]  Vipin Kumar,et al.  Multi-view ensemble learning: an optimal feature set partitioning for high-dimensional data classification , 2015, Knowledge and Information Systems.

[50]  Geert Wets,et al.  A rough sets based characteristic relation approach for dynamic attribute generalization in data mining , 2007, Knowl. Based Syst..

[51]  Chris H. Q. Ding,et al.  Stable feature selection via dense feature groups , 2008, KDD.

[52]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Meng Wang,et al.  Multimodal Graph-Based Reranking for Web Image Search , 2012, IEEE Transactions on Image Processing.

[54]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[55]  Mohammad Masoud Javidi,et al.  Online streaming feature selection using rough sets , 2016, Int. J. Approx. Reason..

[56]  Jianzhong Li,et al.  A stable gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[57]  Huang Qinghua,et al.  Numerical Attribute Reduction Based on Neighborhood Granulation and Rough Approximation , 2008 .

[58]  Xuehua Wang,et al.  Feature selection for high-dimensional imbalanced data , 2013, Neurocomputing.