Simple integrative preprocessing preserves what is shared in data sources

BackgroundBioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting of measurements of the same entity but on different variables, and on tasks where source-specific variation is considered noisy or not interesting. Principal components analysis of all sources combined together is an obvious choice if it is not important to distinguish between data source-specific and shared variation. Canonical Correlation Analysis (CCA) focuses on mutual dependencies and discards source-specific "noise" but it produces a separate set of components for each source.ResultsIt turns out that components given by CCA can be combined easily to produce a linear and hence fast and easily interpretable feature extraction method. The method fuses together several sources, such that the properties they share are preserved. Source-specific variation is discarded as uninteresting. We give the details and implement them in a software tool. The method is demonstrated on gene expression measurements in three case studies: classification of cell cycle regulated genes in yeast, identification of differentially expressed genes in leukemia, and defining stress response in yeast. The software package is available at http://www.cis.hut.fi/projects/mi/software/drCCA/.ConclusionWe introduced a method for the task of data fusion for exploratory data analysis, when statistical dependencies between the sources and not within a source are interesting. The method uses canonical correlation analysis in a new way for dimensionality reduction, and inherits its good properties of being simple, fast, and easily interpretable as a linear projection.

[1]  Mingjun Zhong,et al.  Data Integration for Classification Problems Employing Gaussian Process Priors , 2006, NIPS.

[2]  John A. Berger,et al.  Jointly analyzing gene expression and copy number data in breast cancer using data reduction models , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Samuel Kaski,et al.  Exploratory modeling of yeast stress response and its regulation with gcca and associative clustering , 2005, Int. J. Neural Syst..

[4]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[5]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[6]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[7]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[8]  J. Kettenring,et al.  Canonical Analysis of Several Sets of Variables , 2022 .

[9]  J. Downing,et al.  Classification of pediatric acute lymphoblastic leukemia by gene expression profiling. , 2003, Blood.

[10]  Yoshihiro Yamanishi,et al.  Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis , 2003, ISMB.

[11]  John Shawe-Taylor,et al.  Two view learning: SVM-2K, Theory and Practice , 2005, NIPS.

[12]  E. Lander,et al.  Remodeling of yeast genome expression in response to environmental changes. , 2001, Molecular biology of the cell.

[13]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..