Identifying biologically relevant genes via multiple heterogeneous data sources

Selection of genes that are differentially expressed and critical to a particular biological process has been a major challenge in post-array analysis. Recent development in bioinformatics has made various data sources available such as mRNA and miRNA expression profiles, biological pathway and gene annotation, etc. Efficient and effective integration of multiple data sources helps enrich our knowledge about the involved samples and genes for selecting genes bearing significant biological relevance. In this work, we studied a novel problem of multi-source gene selection: given multiple heterogeneous data sources (or data sets), select genes from expression profiles by integrating information from various data sources. We investigated how to effectively employ information contained in multiple data sources to extract an intrinsic global geometric pattern and use it in covariance analysis for gene selection. We designed and conducted experiments to systematically compare the proposed approach with representative methods in terms of statistical and biological significance, and showed the efficacy and potential of the proposed approach with promising findings.

[1]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[2]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[3]  Tony Greenfield,et al.  Theory and Problems of Probability and Statistics , 1982 .

[4]  Jieping Ye,et al.  Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems , 2005, J. Mach. Learn. Res..

[5]  H. Horvitz,et al.  MicroRNA expression profiles classify human cancers , 2005, Nature.

[6]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[7]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[8]  L. Triplett,et al.  Skin tumor-promoting activity of benzoyl peroxide, a widely used free radical-generating compound. , 1981, Science.

[9]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[10]  F B Kraemer,et al.  Aberrations in normal systemic lipid metabolism in ovarian cancer patients. , 1996, Gynecologic oncology.

[11]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[12]  C. Croce,et al.  MicroRNAs in carcinogenesis , 2007, Cytogenetic and Genome Research.

[13]  László Lovász,et al.  Random Walks on Graphs: A Survey , 1993 .

[14]  Nicolas Le Roux,et al.  Learning Eigenfunctions Links Spectral Embedding and Kernel PCA , 2004, Neural Computation.

[15]  Jennifer G. Dy Unsupervised Feature Selection , 2007 .

[16]  C. Croce,et al.  miR-15 and miR-16 induce apoptosis by targeting BCL2. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[18]  B. Frey,et al.  Using expression profiling data to identify human microRNA targets , 2007, Nature Methods.

[19]  Tijl De Bie,et al.  Kernel-based data fusion for gene prioritization , 2007, ISMB/ECCB.

[20]  Pingzhao Hu,et al.  Computational prediction of cancer-gene function , 2007, Nature Reviews Cancer.

[21]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[22]  Christopher J. C. Burges,et al.  Spectral clustering and transductive learning with multiple views , 2007, ICML '07.

[23]  Shuomin Zhu,et al.  miR-21-mediated tumor growth , 2007, Oncogene.

[24]  Jieping Ye,et al.  Discriminant kernel and regularization parameter learning via semidefinite programming , 2007, ICML '07.

[25]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[26]  L. Asz Random Walks on Graphs: a Survey , 2022 .

[27]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[28]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[29]  Alexandre d'Aspremont,et al.  Optimal Solutions for Sparse Principal Component Analysis , 2007, J. Mach. Learn. Res..

[30]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[31]  Naoto Tsuchiya,et al.  Tumor-suppressive miR-34a induces senescence-like growth arrest through modulation of the E2F pathway in human colon cancer cells , 2007, Proceedings of the National Academy of Sciences.

[32]  Edward R. Dougherty,et al.  What should be expected from feature selection in small-sample settings , 2006, Bioinform..

[33]  Gene H. Golub,et al.  Matrix computations , 1983 .