Multiple-cause discovery combined with structure learning for high-dimensional discrete data and application to stock prediction

Causal discovery in observational data is crucial to a variety of scientific and business research. Although many causal discovery algorithms have been proposed in recent decades, none of them is effective enough in dealing with high-dimensional discrete data. The main challenge is the complex interactions among large volume of variables, leading to numerous spurious causalities found. In this work, we propose a novel multiple-cause discovery method combined with structure learning (McDSL) to eliminate the spurious causalities. The method is carried out in two phases. In the first phase, conditional independence test is used to distinguish direct causal candidates from the indirect ones. In the second phase, causal direction of multi-cause structure is carefully determined with a hybrid causal discovery method. Validation experiments on synthetic data showed that McDSL is reliable in discovering multi-cause structures and eliminating indirect causes. We then applied this algorithm in discovering multiple causes of stock return based on 13-year historical financial data of the Shanghai Stock Exchanges of China, and established a stock prediction model. Experimental results showed that the McDSL discovered causes revealed changes of key risk factors of the stock market over 13 years, which indicated investors should change their investment strategy over time. Moreover, the causes discovered by McDSL have better performance in predicting stock return than that of other common filter-based feature selection algorithms.

[1]  Zexuan Zhu,et al.  Markov blanket-embedded genetic algorithm for gene selection , 2007, Pattern Recognit..

[2]  Carlos Fernandez-Lozano,et al.  Texture classification using feature selection and kernel-based techniques , 2015, Soft Computing.

[3]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[4]  Rajiv Sethi,et al.  Endogenous regime switching in speculative markets , 1996 .

[5]  David C. Yen,et al.  Predicting stock returns by classifier ensembles , 2011, Appl. Soft Comput..

[6]  Adem Karahoca,et al.  A polynomial based algorithm for detection of embolism , 2015, Soft Comput..

[7]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation , 2010, J. Mach. Learn. Res..

[8]  Ming-Chi Lee,et al.  Using support vector machine with a hybrid feature selection method to the stock trend prediction , 2009, Expert Syst. Appl..

[9]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[10]  Ruichu Cai,et al.  SADA: A General Framework to Support Robust Causation Discovery , 2013, ICML.

[11]  Yung-Chun Chang,et al.  A semantic frame-based intelligent agent for topic detection , 2017, Soft Comput..

[12]  Constantin F. Aliferis,et al.  Time and sample efficient discovery of Markov blankets and direct causal relations , 2003, KDD '03.

[13]  Aapo Hyvärinen,et al.  A Linear Non-Gaussian Acyclic Model for Causal Discovery , 2006, J. Mach. Learn. Res..

[14]  Bernhard Schölkopf,et al.  Regression by dependence minimization and its application to causal inference in additive noise models , 2009, ICML '09.

[15]  Ruichu Cai,et al.  Causal gene identification using combinatorial V-structure search , 2013, Neural Networks.

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  Alex Aussem,et al.  A novel Markov boundary based feature subset selection algorithm , 2010, Neurocomputing.

[18]  Shouyang Wang,et al.  A causal feature selection algorithm for stock prediction modeling , 2014, Neurocomputing.

[19]  A. Mattila,et al.  An analysis of e-business adoption and its impact on relational quality in travel agency-supplier relationships. , 2010 .

[20]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[21]  Yi Zuo,et al.  Stock price forecast using Bayesian network , 2012, Expert Syst. Appl..

[22]  Jiji Zhang,et al.  Detection of Unfaithfulness and Robust Causal Inference , 2008, Minds and Machines.

[23]  Bernhard Schölkopf,et al.  Detecting the direction of causal time series , 2009, ICML '09.

[24]  Ruichu Cai,et al.  BASSUM: A Bayesian semi-supervised method for classification feature selection , 2011, Pattern Recognit..

[25]  Chih-Fong Tsai,et al.  Combining multiple feature selection methods for stock prediction: Union, intersection, and multi-intersection approaches , 2010, Decis. Support Syst..

[26]  Y. Kano,et al.  Causal Inference Using Nonnormality , 2004 .

[27]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[28]  Christian Esposito,et al.  Smart Cloud Storage Service Selection Based on Fuzzy Logic, Theory of Evidence and Game Theory , 2016, IEEE Transactions on Computers.

[29]  Ting Liu,et al.  Open-categorical text classification based on multi-LDA models , 2015, Soft Comput..

[30]  Bernhard Schölkopf,et al.  Identifying Cause and Effect on Discrete Data using Additive Noise Models , 2010, AISTATS.

[31]  M. Sobel An Introduction to Causal Inference , 1996 .

[32]  Bernhard Schölkopf,et al.  Causal Inference on Discrete Data Using Additive Noise Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  O. Rosso,et al.  Complexity-entropy causality plane: A useful approach to quantify the stock market inefficiency , 2010 .

[34]  E. Fama,et al.  The Cross‐Section of Expected Stock Returns , 1992 .

[35]  Bernhard Schölkopf,et al.  Nonlinear causal discovery with additive noise models , 2008, NIPS.

[36]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[37]  Jelena Savović,et al.  Methods for Causality Assessment of Adverse Drug Reactions , 2008, Drug safety.