Integrated Theory-and Data-Driven Feature Selection in Gene Expression Data Analysis

The exponential growth of high dimensional biological data has led to a rapid increase in demand for automated approaches for knowledge production. Existing methods rely on two general approaches to address this challenge: 1) the Theorydriven approach, which utilizes prior accumulated knowledge, and 2)the Data-driven approach, which solely utilizes the data to deduce scientific knowledge. Both of these approaches alone suffer from bias toward past/present knowledge, as they fail to incorporate all of the current knowledge that is available to make new discoveries. In this paper, we show how an integrated method can effectively address the high dimensionality of big biological data, which is a major problem for pure datadriven analysis approaches. We realize our approach in a novel two-step analytical workflow that incorporates a new feature selection paradigm as the first step to handling high-throughput gene expression data analysis and that utilizes graphical causal modeling as the second step to handle the automatic extraction of causal relationships. Our results, on real-world clinical datasets from The Cancer Genome Atlas (TCGA), demonstrate that our method is capable of intelligently selecting genes for learning effective causal networks.

[1]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[2]  Elspeth A. Bruford,et al.  Genenames.org: the HGNC resources in 2013 , 2012, Nucleic Acids Res..

[3]  G. Cawley Causal & non-causal feature selection for ridge regression , 2008 .

[4]  Alexandros Labrinidis,et al.  Preferential Diversity , 2015, ExploreDB@SIGMOD/PODS.

[5]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[6]  Panayiotis V. Benos,et al.  MicroRNA expression profiling predicts clinical outcome of carboplatin/paclitaxel-based therapy in metastatic melanoma treated on the ECOG-ACRIN trial E2603 , 2015, Clinical Epigenetics.

[7]  Carlos Caldas,et al.  TOX3 Mutations in Breast Cancer , 2013, PloS one.

[8]  William Christopher Groves,et al.  Toward Automating and Systematizing the Use of Domain Knowledge in Feature Selection , 2015 .

[9]  Andrew J. Sedgewick,et al.  Learning mixed graphical models with separate sparsity parameters and stability-based model selection , 2016, BMC Bioinformatics.

[10]  Peter Spirtes,et al.  Causal discovery and inference: concepts and recent methodological advances , 2016, Applied Informatics.

[11]  Ian Davidson,et al.  Knowledge Driven Dimension Reduction for Clustering , 2009, IJCAI.

[12]  Constantin F. Aliferis,et al.  Causal Feature Selection , 2007 .

[13]  Hwa-Yong Lee,et al.  Blockade of Wnt/β-catenin signaling suppresses breast cancer metastasis by inhibiting CSC-like phenotype , 2015, Scientific Reports.

[14]  Núria Queralt-Rosinach,et al.  DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes , 2015, Database J. Biol. Databases Curation.

[15]  Thomas S. Richardson,et al.  Learning high-dimensional directed acyclic graphs with latent and selection variables , 2011, 1104.5617.

[16]  Zoran Obradovic,et al.  Domain knowledge Based Hierarchical Feature Selection for 30-Day Hospital Readmission Prediction , 2015, AIME.

[17]  Gavin C. Cawley,et al.  Causal and Non-Causal Feature Selection for Ridge Regression , 2008, WCCI Causation and Prediction Challenge.

[18]  Elspeth A. Bruford,et al.  Genenames.org: the HGNC resources in 2015 , 2014, Nucleic Acids Res..

[19]  Tom Heskes,et al.  Causal Discovery from Databases with Discrete and Continuous Variables , 2014, Probabilistic Graphical Models.

[20]  T. Crook,et al.  The p53 pathway in breast cancer , 2002, Breast Cancer Research.

[21]  Andrew Sedgewick,et al.  Graphical models for de novo and pathway-based network prediction over multi-modal high-throughput biological data , 2016 .

[22]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[23]  M. Skolnick,et al.  BRCA1 mutations in primary breast and ovarian carcinomas. , 1994, Science.

[24]  Alberto D. Pascual-Montano,et al.  A survey of dimensionality reduction techniques , 2014, ArXiv.

[25]  C. Knabbe,et al.  TGF‐Beta Signaling in Breast Cancer , 2006, Annals of the New York Academy of Sciences.

[26]  Kerstin B. Meyer,et al.  Master regulators of FGFR2 signalling and breast cancer risk , 2013, Nature Communications.

[27]  Nagiza F. Samatova,et al.  Theory-Guided Data Science: A New Paradigm for Scientific Discovery from Data , 2016, IEEE Transactions on Knowledge and Data Engineering.

[28]  Naftali Kaminski,et al.  T-ReCS: Stable Selection of Dynamically Formed Groups of Features with Application to Prediction of Clinical Outcomes , 2014, Pacific Symposium on Biocomputing.

[29]  Jiuyong Li,et al.  Using causal discovery for feature selection in multivariate numerical time series , 2015, Machine Learning.

[30]  Sheila Seal,et al.  BRCA2 mutations in primary breast and ovarian cancers , 1996, Nature Genetics.

[31]  R. Altman,et al.  Pharmacogenomics Knowledge for Personalized Medicine , 2012, Clinical pharmacology and therapeutics.

[32]  Peter Spirtes,et al.  Introduction to Causal Inference , 2010, J. Mach. Learn. Res..

[33]  Trevor Hastie,et al.  Learning the Structure of Mixed Graphical Models , 2015, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[34]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.