ClearF: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction

Feature selection or scoring methods for the detection of biomarkers are essential in bioinformatics. Various feature selection methods have been developed for the detection of biomarkers, and several studies have employed information-theoretic approaches. However, most of these methods generally require a long processing time. In addition, information-theoretic methods discretize continuous features, which is a drawback that can lead to the loss of information. In this paper, a novel supervised feature scoring method named ClearF is proposed. The proposed method is suitable for continuous-valued data, which is similar to the principle of feature selection using mutual information, with the added advantage of a reduced computation time. The proposed score calculation is motivated by the association between the reconstruction error and the information-theoretic measurement. Our method is based on class-wise low-dimensional embedding and the resulting reconstruction error. Given multi-class datasets such as a case-control study dataset, low-dimensional embedding is first applied to each class to obtain a compressed representation of the class, and also for the entire dataset. Reconstruction is then performed to calculate the error of each feature and the final score for each feature is defined in terms of the reconstruction errors. The correlation between the information theoretic measurement and the proposed method is demonstrated using a simulation. For performance validation, we compared the classification performance of the proposed method with those of various algorithms on benchmark datasets. The proposed method showed higher accuracy and lower execution time than the other established methods. Moreover, an experiment was conducted on the TCGA breast cancer dataset, and it was confirmed that the genes with the highest scores were highly associated with subtypes of breast cancer.

[1]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[2]  Pung-Ling Huang,et al.  Distinct expression of CDCA3, CDCA5, and CDCA8 leads to shorter relapse free survival in breast cancer patient , 2018, Oncotarget.

[3]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[4]  Age K Smilde,et al.  A Critical Assessment of Feature Selection Methods for Biomarker Discovery in Clinical Proteomics* , 2012, Molecular & Cellular Proteomics.

[5]  Daniel Birnbaum,et al.  Genome profiling of ERBB2-amplified breast cancers , 2010, BMC Cancer.

[6]  Yang Li,et al.  Exploring the intrinsic differences among breast tumor subtypes defined using immunohistochemistry markers based on the decision tree , 2016, Scientific Reports.

[7]  P. Kauraniemi,et al.  Activation of multiple cancer-associated genes at the ERBB2 amplicon in breast cancer. , 2006, Endocrine-related cancer.

[8]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[9]  C. Lengerke,et al.  Expression of the embryonic stem cell marker SOX2 in early-stage breast carcinoma , 2011, BMC Cancer.

[10]  Duncan Fyfe Gillies,et al.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data , 2015, Adv. Bioinformatics.

[11]  P. Lu,et al.  Isolation of Live Premature Senescent Cells Using FUCCI Technology , 2016, Scientific Reports.

[12]  R. Reyment,et al.  Statistics and Data Analysis in Geology. , 1988 .

[13]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Randal S. Olson,et al.  Benchmarking Relief-Based Feature Selection Methods , 2017, J. Biomed. Informatics.

[15]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  D. DeMets,et al.  Biomarkers and surrogate endpoints: Preferred definitions and conceptual framework , 2001, Clinical pharmacology and therapeutics.

[17]  Zengyou He,et al.  Stable Feature Selection for Biomarker Discovery , 2010, Comput. Biol. Chem..

[18]  A. Mes-Masson,et al.  A targeted analysis identifies a high frequency of BRCA1 and BRCA2 mutation carriers in women with ovarian cancer from a founder population , 2015, Journal of Ovarian Research.

[19]  Shaorong Gao,et al.  RCOR2 Is a Subunit of the LSD1 Complex That Regulates ESC Property and Substitutes for SOX2 in Reprogramming Somatic Cells to Pluripotency , 2011, Stem cells.

[20]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[21]  T. Sørlie,et al.  Glycan‐related gene expression signatures in breast cancer subtypes; relation to survival , 2015, Molecular oncology.

[22]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[23]  Aidong Zhang,et al.  The interaction index, a novel information-theoretic metric for prioritizing interacting genetic variations and environmental factors , 2009, European Journal of Human Genetics.

[24]  Validation of UBE2C protein as a prognostic marker in node-positive breast cancer , 2009, British Journal of Cancer.

[25]  Kyung-Ah Sohn,et al.  Integrative network analysis for survival-associated gene-gene interactions across multiple genomic profiles in ovarian cancer , 2015, Journal of Ovarian Research.

[26]  E. Schröck,et al.  Chromosomal instability induced by increased BIRC5/Survivin levels affects tumorigenicity of glioma cells , 2017, BMC Cancer.

[27]  Feiping Nie,et al.  Trace Ratio Criterion for Feature Selection , 2008, AAAI.

[28]  V. Helms,et al.  Overexpression of IGF2 mRNA‐Binding Protein 2 (IMP2/p62) as a Feature of Basal‐like Breast Cancer Correlates with Short Survival , 2015, Scandinavian journal of immunology.

[29]  Kyung-Ah Sohn,et al.  Fast detection of high-order epistatic interactions in genome-wide association studies using information theoretic measure , 2014, Comput. Biol. Chem..

[30]  D. Wallace,et al.  An Inherited Heteroplasmic Mutation in Mitochondrial Gene COI in a Patient with Prostate Cancer Alters Reactive Oxygen, Reactive Nitrogen and Proliferation , 2012, BioMed research international.

[31]  J. Kekäläinen,et al.  Lectin staining and flow cytometry reveals female-induced sperm acrosome reaction and surface carbohydrate reorganization , 2015, Scientific Reports.

[32]  O. Yersal,et al.  Biological subtypes of breast cancer: Prognostic and therapeutic implications. , 2014, World journal of clinical oncology.

[33]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[34]  Dar-Ren Chen,et al.  Significant elevation of CLDN16 and HAPLN3 gene expression in human breast cancer. , 2010, Oncology reports.

[35]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  M. Cichna‐Markl,et al.  Hypermethylation of CDKN2A exon 2 in tumor, tumor-adjacent and tumor-distant tissues from breast cancer patients , 2017, BMC Cancer.

[37]  S. Sizemore,et al.  GABA(A) Receptor Pi (GABRP) Stimulates Basal-like Breast Cancer Cell Migration through Activation of Extracellular-regulated Kinase 1/2 (ERK1/2)* , 2014, The Journal of Biological Chemistry.

[38]  Kyung-Ah Sohn,et al.  Integrative information theoretic network analysis for genome-wide association study of aspirin exacerbated respiratory disease in Korean population , 2017, BMC Medical Genomics.

[39]  David G. Stork,et al.  Pattern Classification , 1973 .

[40]  Aidong Zhang,et al.  Information-theoretic gene-gene and gene-environment interaction analysis of quantitative traits , 2009, BMC Genomics.

[41]  P. Lambin,et al.  Machine Learning methods for Quantitative Radiomic Biomarkers , 2015, Scientific Reports.

[42]  A. Giuliano,et al.  FOXC1 identifies basal-like breast cancer in a hereditary breast cancer cohort , 2016, Oncotarget.

[43]  In-Hee Lee,et al.  A filter-based feature selection approach for identifying potential biomarkers for lung cancer , 2011, Journal of Clinical Bioinformatics.

[44]  Kyung-ah Sohn,et al.  Relevance Epistasis Network of Gastritis for Intra-chromosomes in the Korea Associated Resource (KARE) Cohort Study , 2014, Genomics & informatics.

[45]  D. Friedmann-Morvinski,et al.  Dedifferentiation and reprogramming: origins of cancer stem cells , 2014, EMBO reports.

[46]  D. V. Gokhale,et al.  Entropy expressions and their estimators for multivariate distributions , 1989, IEEE Trans. Inf. Theory.

[47]  Helga Thorvaldsdóttir,et al.  Molecular signatures database (MSigDB) 3.0 , 2011, Bioinform..

[48]  N. Dessì,et al.  A Comparative Analysis of Biomarker Selection Techniques , 2013, BioMed research international.

[49]  Mogens Dyhr-Nielsen Loss of information by discretizing hydrologic series , 1972 .

[50]  S. Sitharama Iyengar,et al.  Data-Driven Techniques in Disaster Information Management , 2017, ACM Comput. Surv..