A greedy feature selection algorithm for Big Data of high dimensionality

We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.

[1]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[2]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[3]  Toshiki Sato,et al.  Feature subset selection for logistic regression via mixed integer optimization , 2016, Comput. Optim. Appl..

[4]  Venu Govindaraju,et al.  Parallel Feature Selection Inspired by Group Testing , 2014, NIPS.

[5]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[6]  W. Welch Algorithmic complexity: three NP- hard problems in computational statistics , 1982 .

[7]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[8]  Rodney X. Sturdivant,et al.  Introduction to the Logistic Regression Model , 2005 .

[9]  Yang Feng,et al.  High-dimensional variable selection for Cox's proportional hazards model , 2010, 1002.3315.

[10]  Alan Agresti,et al.  Categorical Data Analysis , 2003 .

[11]  John E. Moody,et al.  Data Visualization and Feature Selection: New Algorithms for Nongaussian Data , 1999, NIPS.

[12]  Ioannis Tsamardinos,et al.  Multi-Source Causal Analysis: Learning Bayesian Networks from Multiple Datasets , 2009, AIAI.

[13]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[14]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[15]  Vincenzo Lagani,et al.  Feature selection for high-dimensional temporal data , 2018, BMC Bioinformatics.

[16]  P. Bühlmann,et al.  Estimation for High‐Dimensional Linear Mixed‐Effects Models Using ℓ1‐Penalization , 2010, 1002.3784.

[17]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[18]  Constantin F. Aliferis,et al.  Algorithms for discovery of multiple Markov boundaries , 2013, J. Mach. Learn. Res..

[19]  Zhifeng Zhang,et al.  Adaptive time-frequency decompositions , 1994 .

[20]  Verónica Bolón-Canedo,et al.  Feature selection for high-dimensional data , 2016, Progress in Artificial Intelligence.

[21]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[22]  Constantin F. Aliferis,et al.  Towards Principled Feature Selection: Relevancy, Filters and Wrappers , 2003 .

[23]  Bernhard Schölkopf,et al.  Kernel-based Conditional Independence Test and Application in Causal Discovery , 2011, UAI.

[24]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[25]  Christopher Ré,et al.  Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System , 2013, Proc. VLDB Endow..

[26]  Betsy Jane Becker,et al.  The Synthesis of Regression Slopes in Meta-Analysis. , 2007, 0801.4442.

[27]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[28]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[29]  H. White Maximum Likelihood Estimation of Misspecified Models , 1982 .

[30]  Vincenzo Lagani,et al.  Biomarker signature identification in “omics” data with multi-class outcome , 2013, Computational and structural biotechnology journal.

[31]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[32]  Charles E McCulloch,et al.  Relaxing the rule of ten events per variable in logistic and Cox regression. , 2007, American journal of epidemiology.

[33]  Jeremy Kubica,et al.  Parallel Large Scale Feature Selection for Logistic Regression , 2009, SDM.

[34]  Q. Vuong Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses , 1989 .

[35]  T. Blumensath,et al.  On the Difference Between Orthogonal Matching Pursuit and Orthogonal Least Squares , 2007 .

[36]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[37]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[38]  Franck Picard,et al.  Adaptive Lasso and group-Lasso for functional Poisson regression , 2014, J. Mach. Learn. Res..

[39]  Sebastian Thrun,et al.  Bayesian Network Induction via Local Neighborhoods , 1999, NIPS.

[40]  Constantin F. Aliferis,et al.  Time and sample efficient discovery of Markov blankets and direct causal relations , 2003, KDD '03.

[41]  Verónica Bolón-Canedo,et al.  An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[42]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation , 2010, J. Mach. Learn. Res..

[43]  Alan Gray,et al.  A new tool called DISSECT for analysing large genomic data sets using a Big Data approach , 2015, Nature Communications.

[44]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[45]  Vincenzo Lagani,et al.  Structure-based variable selection for survival data , 2010, Bioinform..

[46]  Dimitris Margaritis Toward Provably Correct Feature Selection in Arbitrary Domains , 2009, NIPS.

[47]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[48]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  R. Engle Wald, likelihood ratio, and Lagrange multiplier tests in econometrics , 1984 .

[50]  P. Spirtes,et al.  Causation, Prediction, and Search, 2nd Edition , 2001 .

[51]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[52]  Ivor W. Tsang,et al.  The Emerging "Big Dimensionality" , 2014, IEEE Computational Intelligence Magazine.

[53]  E Pirogova,et al.  The Cytotoxic Effects of Low Intensity Visible and Infrared Light on Human Breast Cancer (MCF7) cells , 2013, Computational and structural biotechnology journal.

[54]  Robert V. Foutz,et al.  The Performance of the Likelihood Ratio Test When the Model is Incorrect , 1977 .

[55]  L. Hedges,et al.  Fixed- and random-effects models in meta-analysis. , 1998 .

[56]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[57]  Paul M. Thompson,et al.  Parallel Lasso Screening for Big Data Optimization , 2016, KDD.

[58]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[59]  S. Weisberg Applied Linear Regression: Weisberg/Applied Linear Regression 3e , 2005 .

[60]  Thomas M. Loughin,et al.  A systematic comparison of methods for combining p , 2004, Comput. Stat. Data Anal..

[61]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[62]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[63]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[64]  Xiangyu Wang,et al.  DECOrrelated feature space partitioning for distributed sparse regression , 2016, NIPS.

[65]  Malik Beshir Malik,et al.  Applied Linear Regression , 2005, Technometrics.

[66]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[67]  Jesper Tegnér,et al.  Towards scalable and data efficient learning of Markov boundaries , 2007, Int. J. Approx. Reason..

[68]  D. Bertsimas,et al.  Best Subset Selection via a Modern Optimization Lens , 2015, 1507.03133.

[69]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[70]  Zheng Zhao,et al.  Massively parallel feature selection: an approach based on variance preservation , 2012, Machine Learning.

[71]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[72]  Judea Pearl,et al.  Causal networks: semantics and expressiveness , 2013, UAI.

[73]  Ming Yan,et al.  Parallel and distributed sparse optimization , 2013, 2013 Asilomar Conference on Signals, Systems and Computers.

[74]  P. Spirtes,et al.  Ancestral graph Markov models , 2002 .

[75]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[76]  A. Atkinson Subset Selection in Regression , 1992 .

[77]  Constantin F. Aliferis,et al.  Algorithms for Large Scale Markov Blanket Discovery , 2003, FLAIRS.

[78]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[79]  Pengtao Xie,et al.  Strategies and Principles of Distributed Machine Learning on Big Data , 2015, ArXiv.

[80]  Seunghak Lee,et al.  On Model Parallelization and Scheduling Strategies for Distributed Machine Learning , 2014, NIPS.

[81]  M. A. Chaudhry,et al.  On a Class of Incomplete Gamma Functions with Applications , 2001 .

[82]  R. Fisher,et al.  Statistical Methods for Research Workers , 1930, Nature.

[83]  Judea Pearl,et al.  A Theory of Inferred Causation , 1991, KR.

[84]  Verónica Bolón-Canedo,et al.  Recent advances and emerging challenges of feature selection in the context of big data , 2015, Knowl. Based Syst..

[85]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[86]  Michael H. Kutner Applied Linear Statistical Models , 1974 .

[87]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[88]  Constantin F. Aliferis,et al.  HITON: A Novel Markov Blanket Algorithm for Optimal Variable Selection , 2003, AMIA.

[89]  Alessio Farcomeni,et al.  Feature Selection with the R Package MXM: Discovering Statistically-Equivalent Feature Subsets , 2016, 1611.03227.

[90]  L. Armijo Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .

[91]  S. S. Wilks The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[92]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[93]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[94]  Marc Lavielle,et al.  Parameter Estimation in Nonlinear Mixed Effect Models Using saemix, an R Implementation of the SAEM Algorithm , 2017 .

[95]  Verónica Bolón-Canedo,et al.  Exploring the consequences of distributed feature selection in DNA microarray data , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[96]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[97]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..