Massively-Parallel Feature Selection for Big Data

We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of p-values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimal-ity for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores, while dominating other competitive algorithms in its class.

[1]  Ioannis Tsamardinos,et al.  Multi-Source Causal Analysis: Learning Bayesian Networks from Multiple Datasets , 2009, AIAI.

[2]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[3]  G. Cizek,et al.  An Introduction to Logistic Regression. , 1999 .

[4]  L. Hedges,et al.  Fixed- and random-effects models in meta-analysis. , 1998 .

[5]  Vincenzo Lagani,et al.  Structure-based variable selection for survival data , 2010, Bioinform..

[6]  Dimitris Margaritis Toward Provably Correct Feature Selection in Arbitrary Domains , 2009, NIPS.

[7]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[8]  Toshiki Sato,et al.  Feature subset selection for logistic regression via mixed integer optimization , 2016, Computational Optimization and Applications.

[9]  S. Weisberg Applied Linear Regression , 1981 .

[10]  Constantin F. Aliferis,et al.  Towards Principled Feature Selection: Relevancy, Filters and Wrappers , 2003 .

[11]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[12]  Jesper Tegnér,et al.  Towards scalable and data efficient learning of Markov boundaries , 2007, Int. J. Approx. Reason..

[13]  Bernhard Schölkopf,et al.  Kernel-based Conditional Independence Test and Application in Causal Discovery , 2011, UAI.

[14]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[15]  Christopher Ré,et al.  Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System , 2013, Proc. VLDB Endow..

[16]  Betsy Jane Becker,et al.  The Synthesis of Regression Slopes in Meta-Analysis. , 2007, 0801.4442.

[17]  Rodney X. Sturdivant,et al.  Introduction to the Logistic Regression Model , 2005 .

[18]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[19]  P. Bühlmann,et al.  Estimation for High‐Dimensional Linear Mixed‐Effects Models Using ℓ1‐Penalization , 2010, 1002.3784.

[20]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[21]  Ming Yan,et al.  Parallel and distributed sparse optimization , 2013, 2013 Asilomar Conference on Signals, Systems and Computers.

[22]  Constantin F. Aliferis,et al.  Algorithms for discovery of multiple Markov boundaries , 2013, J. Mach. Learn. Res..

[23]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[24]  Ivor W. Tsang,et al.  The Emerging "Big Dimensionality" , 2014, IEEE Computational Intelligence Magazine.

[25]  Pengtao Xie,et al.  Strategies and Principles of Distributed Machine Learning on Big Data , 2015, ArXiv.

[26]  P. Spirtes,et al.  Ancestral graph Markov models , 2002 .

[27]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[28]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[29]  Franck Picard,et al.  Adaptive Lasso and group-Lasso for functional Poisson regression , 2014, J. Mach. Learn. Res..

[30]  Sebastian Thrun,et al.  Bayesian Network Induction via Local Neighborhoods , 1999, NIPS.

[31]  E. Steyerberg,et al.  [Regression modeling strategies]. , 2011, Revista espanola de cardiologia.

[32]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation , 2010, J. Mach. Learn. Res..

[33]  Paul M. Thompson,et al.  Parallel Lasso Screening for Big Data Optimization , 2016, KDD.

[34]  Seunghak Lee,et al.  On Model Parallelization and Scheduling Strategies for Distributed Machine Learning , 2014, NIPS.

[35]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[36]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[37]  S. S. Wilks The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[38]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  R. Engle Wald, likelihood ratio, and Lagrange multiplier tests in econometrics , 1984 .

[40]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[41]  Alan J. Miller Subset Selection in Regression , 1992 .

[42]  Q. Vuong Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses , 1989 .

[43]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[44]  Alan Gray,et al.  A new tool called DISSECT for analysing large genomic data sets using a Big Data approach , 2015, Nature Communications.

[45]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[46]  Constantin F. Aliferis,et al.  Time and sample efficient discovery of Markov blankets and direct causal relations , 2003, KDD '03.

[47]  Verónica Bolón-Canedo,et al.  An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[48]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[49]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[50]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[51]  Robert V. Foutz,et al.  The Performance of the Likelihood Ratio Test When the Model is Incorrect , 1977 .

[52]  Zheng Zhao,et al.  Massively parallel feature selection: an approach based on variance preservation , 2012, Machine Learning.

[53]  Judea Pearl,et al.  Causal networks: semantics and expressiveness , 2013, UAI.

[54]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[55]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[56]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[57]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[58]  Vincenzo Lagani,et al.  Biomarker signature identification in “omics” data with multi-class outcome , 2013, Computational and structural biotechnology journal.

[59]  Charles E McCulloch,et al.  Relaxing the rule of ten events per variable in logistic and Cox regression. , 2007, American journal of epidemiology.

[60]  Jeremy Kubica,et al.  Parallel Large Scale Feature Selection for Logistic Regression , 2009, SDM.

[61]  D. Bertsimas,et al.  Best Subset Selection via a Modern Optimization Lens , 2015, 1507.03133.

[62]  Thomas M. Loughin,et al.  A systematic comparison of methods for combining p , 2004, Comput. Stat. Data Anal..

[63]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[64]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[65]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[66]  Giorgos Borboudakis,et al.  Forward-Backward Selection with Early Dropping , 2017, J. Mach. Learn. Res..

[67]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[68]  Constantin F. Aliferis,et al.  Algorithms for Large Scale Markov Blanket Discovery , 2003, FLAIRS.

[69]  H. White Maximum Likelihood Estimation of Misspecified Models , 1982 .

[70]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[71]  Judea Pearl,et al.  A Theory of Inferred Causation , 1991, KR.

[72]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[73]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[74]  Constantin F. Aliferis,et al.  HITON: A Novel Markov Blanket Algorithm for Optimal Variable Selection , 2003, AMIA.

[75]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[76]  Alessio Farcomeni,et al.  Feature Selection with the R Package MXM: Discovering Statistically-Equivalent Feature Subsets , 2016, 1611.03227.

[77]  L. Armijo Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .

[78]  W. Welch Algorithmic complexity: three NP- hard problems in computational statistics , 1982 .