Online Streaming Feature Selection via Conditional Independence

Online feature selection is a challenging topic in data mining. It aims to reduce the dimensionality of streaming features by removing irrelevant and redundant features in real time. Existing works, such as Alpha-investing and Online Streaming Feature Selection (OSFS), have been proposed to serve this purpose, but they have drawbacks, including low prediction accuracy and high running time if the streaming features exhibit characteristics such as low redundancy and high relevance. In this paper, we propose a novel algorithm about online streaming feature selection, named ConInd that uses a three-layer filtering strategy to process streaming features with the aim of overcoming such drawbacks. Through three-layer filtering, i.e., null-conditional independence, single-conditional independence, and multi-conditional independence, we can obtain an approximate Markov blanket with high accuracy and low running time. To validate the efficiency, we implemented the proposed algorithm and tested its performance on a prevalent dataset, i.e., NIPS 2003 and Causality Workbench. Through extensive experimental results, we demonstrated that ConInd offers significant performance improvements in prediction accuracy and running time compared to Alpha-investing and OSFS. ConInd offers 5.62% higher average prediction accuracy than Alpha-investing, with a 53.56% lower average running time compared to that for OSFS when the dataset is lowly redundant and highly relevant. In addition, the ratio of the average number of features for ConInd is 242% less than that for Alpha-investing.

[1]  Yu-Lin He,et al.  Fuzziness based semi-supervised learning approach for intrusion detection system , 2017, Inf. Sci..

[2]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[3]  Yongsub Lim,et al.  Time-weighted counting for recently frequent pattern mining in data streams , 2017, Knowledge and Information Systems.

[4]  Jing Wang,et al.  Online Feature Selection with Group Structure Analysis , 2015, IEEE Transactions on Knowledge and Data Engineering.

[5]  Xindong Wu,et al.  Towards Scalable and Accurate Online Feature Selection for Big Data , 2014, 2014 IEEE International Conference on Data Mining.

[6]  Ke Wang,et al.  TopicSketch: Real-Time Bursty Topic Detection from Twitter , 2013, 2013 IEEE 13th International Conference on Data Mining.

[7]  Constantin F. Aliferis,et al.  HITON: A Novel Markov Blanket Algorithm for Optimal Variable Selection , 2003, AMIA.

[8]  Qinghua Hu,et al.  Multi-label feature selection with streaming labels , 2016, Inf. Sci..

[9]  Jing Zhou,et al.  Streaming feature selection using alpha-investing , 2005, KDD '05.

[10]  Rong Jin,et al.  Online Feature Selection and Its Applications , 2014, IEEE Transactions on Knowledge and Data Engineering.

[11]  André Elisseeff,et al.  Using Markov Blankets for Causal Structure Learning , 2008, J. Mach. Learn. Res..

[12]  Xiang Zhang,et al.  Automated Medical Diagnosis by Ranking Clusters Across the Symptom-Disease Network , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[13]  Tao Li,et al.  Recent advances in feature selection and its applications , 2017, Knowledge and Information Systems.

[14]  Shulin Wang,et al.  Feature selection in machine learning: A new perspective , 2018, Neurocomputing.

[15]  Rauf Izmailov,et al.  Feature Selection in Learning Using Privileged Information , 2017, 2017 IEEE International Conference on Data Mining Workshops (ICDMW).

[16]  Huan Liu,et al.  Feature selection for classification: A review , 2014 .

[17]  Xindong Wu,et al.  Towards Mining Trapezoidal Data Streams , 2015, 2015 IEEE International Conference on Data Mining.

[18]  Hao Wang,et al.  Causal Discovery from Streaming Features , 2010, 2010 IEEE International Conference on Data Mining.

[19]  John Robinson,et al.  Automatic Classification of Music Genre Using Masked Conditional Neural Networks , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[20]  Constantin F. Aliferis,et al.  Algorithms for discovery of multiple Markov boundaries , 2013, J. Mach. Learn. Res..

[21]  James Theiler,et al.  Online Feature Selection using Grafting , 2003, ICML.

[22]  Vadim Sokolov,et al.  Deep Learning: A Bayesian Perspective , 2017, ArXiv.

[23]  Hao Wang,et al.  Online Feature Selection with Streaming Features , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[25]  Vipin Kumar,et al.  Feature Selection: A literature Review , 2014, Smart Comput. Rev..

[26]  Hao Wang,et al.  Markov Blanket Feature Selection Using Representative Sets , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[27]  Hao Wang,et al.  Classification with Streaming Features: An Emerging-Pattern Mining Approach , 2015, TKDD.

[28]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation , 2010, J. Mach. Learn. Res..

[29]  Xindong Wu,et al.  Online Learning from Trapezoidal Data Streams , 2016, IEEE Transactions on Knowledge and Data Engineering.

[30]  Jing Wang,et al.  A survey on online feature selection with streaming features , 2018, Frontiers of Computer Science.

[31]  Constantin F. Aliferis,et al.  Time and sample efficient discovery of Markov blankets and direct causal relations , 2003, KDD '03.

[32]  Bor-Chen Kuo,et al.  Feature Mining for Hyperspectral Image Classification , 2013, Proceedings of the IEEE.

[33]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[34]  Vikram Pudi,et al.  AutoLearn — Automated Feature Generation and Selection , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[35]  Constantin F. Aliferis,et al.  Towards Principled Feature Selection: Relevancy, Filters and Wrappers , 2003 .

[36]  Leslie S. Smith,et al.  Feature subset selection in large dimensionality domains , 2010, Pattern Recognit..

[37]  Hao Wang,et al.  Markov Blanket Feature Selection with Non-faithful Data Distributions , 2013, 2013 IEEE 13th International Conference on Data Mining.

[38]  Xindong Wu,et al.  LOFS: Library of Online Streaming Feature Selection , 2016, Knowl. Based Syst..

[39]  Hong-Han Shuai,et al.  Distributed and scalable sequential pattern mining through stream processing , 2017, Knowledge and Information Systems.

[40]  Constantin F. Aliferis,et al.  Causal Explorer: A Causal Probabilistic Network Learning Toolkit for Biomedical Discovery , 2003, METMBS.