Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation

We present an algorithmic framework for learning local causal structure around target variables of interest in the form of direct causes/effects and Markov blankets applicable to very large data sets with relatively small samples. The selected feature sets can be used for causal discovery and classification. The framework (Generalized Local Learning, or GLL) can be instantiated in numerous ways, giving rise to both existing state-of-the-art as well as novel algorithms. The resulting algorithms are sound under well-defined sufficient conditions. In a first set of experiments we evaluate several algorithms derived from this framework in terms of predictivity and feature set parsimony and compare to other local causal discovery methods and to state-of-the-art non-causal feature selection methods using real data. A second set of experimental evaluations compares the algorithms in terms of ability to induce local causal neighborhoods using simulated and resimulated data and examines the relation of predictivity with causal induction performance. Our experiments demonstrate, consistently with causal feature selection theory, that local causal feature selection methods (under broad assumptions encompassing appropriate family of distributions, types of classifiers, and loss functions) exhibit strong feature set parsimony, high predictivity and local causal interpretability. Although non-causal feature selection methods are often used in practice to shed light on causal relationships, we find that they cannot be interpreted causally even when they achieve excellent predictivity. Therefore we conclude that only local causal techniques should be used when insight into causal structure is sought. In a companion paper we examine in depth the behavior of GLL algorithms, provide extensions, and show how local techniques can be used for scalable and accurate global causal graph learning.

[1]  Constantin F. Aliferis,et al.  Algorithms for discovery of multiple Markov boundaries: application to the molecular signature multiplicity problem , 2008 .

[2]  Jin Tian,et al.  Causal Discovery from Changes: a Bayesian Approach , 2001, UAI 2001.

[3]  Rajeev Motwani,et al.  Scalable Techniques for Mining Causal Structures , 1998, Data Mining and Knowledge Discovery.

[4]  Constantin F. Aliferis,et al.  A Comparison of Novel and State-of-the-Art Polynomial Bayesian Network Learning Algorithms , 2005, AAAI.

[5]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[6]  David Page,et al.  KDD Cup 2001 report , 2002, SKDD.

[7]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[8]  Douglas A. Wolfe,et al.  Nonparametric Statistical Methods , 1973 .

[9]  Gregory F. Cooper,et al.  Exact model averaging with naive Bayesian classifiers , 2002, ICML.

[10]  Laura E. Brown,et al.  Scaling-Up Bayesian Network Learning to Thousands of Variables Using Local Learning Techniques , 2003 .

[11]  David A. Bell,et al.  Learning Bayesian networks from data: An information-theory based approach , 2002, Artif. Intell..

[12]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[13]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[14]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[15]  Glenn Fung,et al.  A Feature Selection Newton Method for Support Vector Machine Classification , 2004, Comput. Optim. Appl..

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  Dean P. Foster,et al.  Variable Selection in Data Mining , 2004 .

[18]  Constantin F. Aliferis,et al.  Formative Evaluation of a Prototype System for Automated Analysis of Mass Spectrometry Data , 2005, AMIA.

[19]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[20]  Constantin F. Aliferis,et al.  Learning Boolean Queries for Article Quality Filtering , 2004, MedInfo.

[21]  Judea Pearl,et al.  Equivalence and Synthesis of Causal Models , 1990, UAI.

[22]  Russell Greiner,et al.  Learning Bayesian Belief Network Classifiers: Algorithms and System , 2001, Canadian Conference on AI.

[23]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[24]  David Maxwell Chickering,et al.  Learning Equivalence Classes of Bayesian Network Structures , 1996, UAI.

[25]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[26]  Constantin F. Aliferis,et al.  Towards Principled Feature Selection: Relevancy, Filters and Wrappers , 2003 .

[27]  Gregory F. Cooper,et al.  An evaluation of a system that recommends microarray experiments to perform to discover gene-regulation pathways , 2004, Artif. Intell. Medicine.

[28]  Lorenz Wernisch,et al.  Reconstruction of gene networks using Bayesian learning and manipulation experiments , 2004, Bioinform..

[29]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[30]  Dennis R. Durbin,et al.  The learning classifier system: an evolutionary computation approach to knowledge discovery in epidemiologic surveillance , 2000, Artif. Intell. Medicine.

[31]  Rich Caruana,et al.  Greedy Attribute Selection , 1994, ICML.

[32]  Rema Padman,et al.  PCX : Markov Blanket Classification for Large Data Sets with Few Cases , 2004 .

[33]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[34]  Yindalon Aphinyanagphongs,et al.  Research Paper: A Comparison of Citation Metrics to Machine Learning Filters for the Identification of High Quality MEDLINE Documents , 2006, J. Am. Medical Informatics Assoc..

[35]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[36]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[37]  Frederick Eberhardt,et al.  N-1 Experiments Suffice to Determine the Causal Relations Among N Variables , 2006 .

[38]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[39]  Gregory F. Cooper,et al.  Causal Discovery from a Mixture of Experimental and Observational Data , 1999, UAI.

[40]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[41]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[42]  Constantin F. Aliferis,et al.  Identifying Markov blankets with decision tree induction , 2003, Third IEEE International Conference on Data Mining.

[43]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[44]  Constantin F. Aliferis,et al.  An Evaluation of an Algorithm for Inductive Learning of Bayesian Belief Networks Using Simulated Data Sets , 1994, UAI.

[45]  Dimitris Margaritis,et al.  Speculative Markov blanket discovery for optimal feature selection , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[46]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[47]  Sebastian Thrun,et al.  Bayesian Network Induction via Local Neighborhoods , 1999, NIPS.

[48]  Nir Friedman,et al.  Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm , 1999, UAI.

[49]  Constantin F. Aliferis,et al.  GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data , 2005, Int. J. Medical Informatics.

[50]  David Maxwell Chickering,et al.  Optimal Structure Identification With Greedy Search , 2002, J. Mach. Learn. Res..

[51]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[52]  Theodor Mader,et al.  Feature Selection with the CLOP Package , 2006 .

[53]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[54]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[55]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[56]  Richard E. Neapolitan,et al.  Probabilistic reasoning in expert systems - theory and algorithms , 2012 .

[57]  Andrew W. Moore,et al.  Optimal Reinsertion: A New Search Operator for Accelerated and More Accurate Bayesian Network Structure Learning , 2003, ICML.

[58]  Tobias Scheffer,et al.  Error Estimation and Model Selection , 1999, Künstliche Intell..

[59]  Constantin F. Aliferis,et al.  HITON: A Novel Markov Blanket Algorithm for Optimal Variable Selection , 2003, AMIA.

[60]  H. Zou,et al.  The doubly regularized support vector machine , 2006 .

[61]  E. Petricoin,et al.  High-resolution serum proteomic features for ovarian cancer detection. , 2004, Endocrine-related cancer.

[62]  Constantin F. Aliferis,et al.  Modeling liver transplant survival: Comparing techniques of deriving predictor sets , 2005 .

[63]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[64]  Mtw,et al.  Computation, causation, and discovery , 2000 .

[65]  Constantin F. Aliferis,et al.  LARGE-SCALE FEATURE SELECTION USING MARKOV BLANKET INDUCTION FOR THE PREDICTION OF PROTEIN-DRUG BINDING , 2002 .

[66]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[67]  D. Hardin,et al.  Using SVM Weight-Based Methods to Identify Causally Relevant and Non-Causally Relevant Variables , 2006 .

[68]  Alain Rakotomamonjy,et al.  Variable Selection Using SVM-based Criteria , 2003, J. Mach. Learn. Res..

[69]  Jesper Tegnér,et al.  Towards scalable and data efficient learning of Markov boundaries , 2007, Int. J. Approx. Reason..

[70]  Constantin F. Aliferis,et al.  An evaluation of machine-learning methods for predicting pneumonia mortality , 1997, Artif. Intell. Medicine.

[71]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[72]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[73]  W. Wong,et al.  Transitive functional annotation by shortest-path analysis of gene expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[74]  Jesper Tegnér,et al.  Scalable, Efficient and Correct Learning of Markov Boundaries Under the Faithfulness Assumption , 2005, ECSQARU.

[75]  Constantin F. Aliferis,et al.  Time and sample efficient discovery of Markov blankets and direct causal relations , 2003, KDD '03.

[76]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[77]  Gregory F. Cooper,et al.  Causal Discovery Using A Bayesian Local Causal Discovery Algorithm , 2004, MedInfo.

[78]  Jiji Zhang,et al.  Adjacency-Faithfulness and Conservative Causal Inference , 2006, UAI.

[79]  Constantin F. Aliferis,et al.  Causal Explorer: A Causal Probabilistic Network Learning Toolkit for Biomedical Discovery , 2003, METMBS.

[80]  Constantin F. Aliferis,et al.  The max-min hill-climbing Bayesian network structure learning algorithm , 2006, Machine Learning.

[81]  Sandrine Dudoit,et al.  Asymptotics of Cross-Validated Risk Estimation in Model Selection and Performance Assessment , 2003 .

[82]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[83]  Constantin F. Aliferis,et al.  Algorithms for Large Scale Markov Blanket Discovery , 2003, FLAIRS.

[84]  Jesper Tegnér,et al.  Growing Bayesian network models of gene networks from seed genes , 2005, ECCB/JBI.

[85]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[86]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[87]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[88]  Kevin Murphy,et al.  Active Learning of Causal Bayes Net Structure , 2006 .

[89]  Gregory F. Cooper,et al.  A Simple Constraint-Based Algorithm for Efficiently Mining Observational Databases for Causal Relationships , 1997, Data Mining and Knowledge Discovery.

[90]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[91]  Constantin F. Aliferis,et al.  Extracting Drug-Drug Interaction Articles from MEDLINE to Improve the Content of Drug Databases , 2005, AMIA.

[92]  Nir Friedman,et al.  Data Analysis with Bayesian Networks: A Bootstrap Approach , 1999, UAI.

[93]  Laura E. Brown,et al.  Bounding the False Discovery Rate in Local Bayesian Network Learning , 2008, AAAI.

[94]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[95]  Constantin F. Aliferis,et al.  Modeling Clinical Judgment and Implicit Guideline Compliance in the Diagnosisof Melanomas Using Machine Learning , 2005, AMIA.

[96]  Bernard Manderick,et al.  Learning Causal Bayesian Networks from Observations and Experiments: A Decision Theoretic Approach , 2006, MDAI.

[97]  C. Aliferis,et al.  Algorithms for Large-Scale Local Causal Discovery and Feature Selection In the Presence Of Limited Sample Or Large Causal Neighbourhoods , 2002 .

[98]  Constantin F. Aliferis,et al.  A theoretical characterization of linear SVM-based feature selection , 2004, ICML '04.

[99]  Bart De Moor,et al.  Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks , 2006, ISMB.

[100]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[101]  Judea Pearl,et al.  A Theory of Inferred Causation , 1991, KR.

[102]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[103]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[104]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[105]  Daphne Koller,et al.  Active Learning for Structure in Bayesian Networks , 2001, IJCAI.

[106]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[107]  Gregory F. Cooper,et al.  Causal Discovery from Population-Based Infant Birth and Death Records , 1999, AAAI/IAAI.

[108]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[109]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions , 2010, J. Mach. Learn. Res..

[110]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[111]  Frederick Eberhardt,et al.  On the Number of Experiments Sufficient and in the Worst Case Necessary to Identify All Causal Relations Among N Variables , 2005, UAI.