Partitioning of functional gene expression data using principal points

BackgroundDNA microarrays offer motivation and hope for the simultaneous study of variations in multiple genes. Gene expression is a temporal process that allows variations in expression levels with a characterized gene function over a period of time. Temporal gene expression curves can be treated as functional data since they are considered as independent realizations of a stochastic process. This process requires appropriate models to identify patterns of gene functions. The partitioning of the functional data can find homogeneous subgroups of entities for the massive genes within the inherent biological networks. Therefor it can be a useful technique for the analysis of time-course gene expression data. We propose a new self-consistent partitioning method of functional coefficients for individual expression profiles based on the orthonormal basis system.ResultsA principal points based functional partitioning method is proposed for time-course gene expression data. The method explores the relationship between genes using Legendre coefficients as principal points to extract the features of gene functions. Our proposed method provides high connectivity in connectedness after clustering for simulated data and finds a significant subsets of genes with the increased connectivity. Our approach has comparative advantages that fewer coefficients are used from the functional data and self-consistency of principal points for partitioning. As real data applications, we are able to find partitioned genes through the gene expressions found in budding yeast data and Escherichia coli data.ConclusionsThe proposed method benefitted from the use of principal points, dimension reduction, and choice of orthogonal basis system as well as provides appropriately connected genes in the resulting subsets. We illustrate our method by applying with each set of cell-cycle-regulated time-course yeast genes and E. coli genes. The proposed method is able to identify highly connected genes and to explore the complex dynamics of biological systems in functional genomics.

[1]  K. H. Wolfe,et al.  Functional Partitioning of Yeast Co-Expression Networks after Genome Duplication , 2006, PLoS biology.

[2]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[3]  Chang-Tsun Li,et al.  Partial mixture model for tight clustering of gene expression time-course , 2007, BMC Bioinformatics.

[4]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[5]  Anirban Mukhopadhyay,et al.  A Survey and Comparative Study of Statistical Tests for Identifying Differential Expression from Microarray Data , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Fang-Xiang Wu,et al.  A Genetic K-means Clustering Algorithm Applied to Gene Expression Data , 2003, Canadian Conference on AI.

[7]  Oded Maimon,et al.  Evaluation of gene-expression clustering via mutual information distance measure , 2007, BMC Bioinformatics.

[8]  Haseong Kim,et al.  Clustering of change patterns using Fourier coefficients , 2008, Bioinform..

[9]  C. Abraham,et al.  Unsupervised Curve Clustering using B‐Splines , 2003 .

[10]  Hiroshi Kurata On principal points for location mixtures of spherically symmetric distributions , 2008 .

[11]  James O. Ramsay,et al.  Functional Data Analysis , 2005 .

[12]  Ujjwal Maulik,et al.  RANWAR: Rank-Based Weighted Association Rule Mining From Gene Expression and Methylation Data , 2015, IEEE Transactions on NanoBioscience.

[13]  G. Dunteman Principal Components Analysis , 1989 .

[14]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[15]  Mingming Xin,et al.  Expression partitioning of homeologs and tandem duplications contribute to salt tolerance in wheat (Triticum aestivum L.) , 2016, Scientific Reports.

[16]  João Ricardo Sato,et al.  Time-varying modeling of gene expression regulatory networks using the wavelet dynamic vector autoregressive method , 2007, Bioinform..

[17]  Thaddeus Tarpey Self-Consistent Patterns for Symmetric Multivariate Distributions , 1998 .

[18]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[19]  山本 渉,et al.  Principal Points の部分空間性について , 2000 .

[20]  H. Müller,et al.  Functional Data Analysis for Sparse Longitudinal Data , 2005 .

[21]  Kui Wang,et al.  Clustering of time-course gene expression profiles using normal mixture models with autoregressive random effects , 2012, BMC Bioinformatics.

[22]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[23]  L. Wasserman,et al.  CATS , 2005 .

[24]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[25]  Hans-Georg Müller,et al.  Classification using functional data analysis for temporal gene expression data , 2006, Bioinform..

[26]  T. Hastie,et al.  Principal Curves , 2007 .

[27]  Charu C. Aggarwal,et al.  An Introduction to Cluster Analysis , 2018, Data Clustering: Algorithms and Applications.

[28]  Jia-Shung Wang,et al.  Interpolation based consensus clustering for gene expression time series , 2015, BMC Bioinformatics.

[29]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[30]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[31]  Saurav Mallik,et al.  Integrating Multiple Data Sources for Combinatorial Marker Discovery: A Study in Tumorigenesis , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  T. Speed,et al.  GOstat: find statistically overrepresented Gene Ontologies within a group of genes. , 2004, Bioinformatics.

[33]  Thaddeus Tarpey,et al.  Clustering Functional Data , 2003, J. Classif..

[34]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Martin Brandl,et al.  Protein O‐mannosylation is crucial for cell wall integrity, septation and viability in fission yeast , 2005, Molecular microbiology.

[36]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[37]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[38]  Thaddeus Tarpey,et al.  Representing a Large Collection of Curves: A Case for Principal Points , 1993 .

[39]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[40]  Ujjwal Maulik,et al.  Integrated Statistical and Rule-Mining Techniques for Dna Methylation and Gene Expression Data Analysis , 2013, J. Artif. Intell. Soft Comput. Res..

[41]  Ujjwal Maulik,et al.  Identifying Epigenetic Biomarkers using Maximal Relevance and Minimal Redundancy Based Feature Selection for Multi-Omics Data , 2017, IEEE Transactions on NanoBioscience.

[42]  Hans-Georg Müller,et al.  Functional Data Analysis , 2016 .

[43]  Carlos Barros,et al.  Finite Mixture Model , 2017, Encyclopedia of Machine Learning and Data Mining.

[44]  A. Khodursky,et al.  A classification based framework for quantitative description of large-scale microarray data , 2006 .

[45]  John A. Rice,et al.  Displaying the important features of large collections of similar curves , 1992 .

[46]  Thaddeus Tarpey,et al.  Self-Consistency Algorithms , 1999 .

[47]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[48]  J. A. López del Val,et al.  Principal Components Analysis , 2018, Applied Univariate, Bivariate, and Multivariate Statistics Using Python.

[49]  Concha Bielza,et al.  Finite Mixture Model , 2014 .

[50]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[51]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[52]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[53]  L. Love,et al.  A magnetocaloric pump for microfluidic applications , 2004, IEEE Transactions on NanoBioscience.

[54]  T. Tarpey,et al.  Profiling Placebo Responders by Self-Consistent Partitioning of Functional Data , 2003 .

[55]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[56]  Eva Petkova,et al.  Optimal Partitioning for Linear Mixed Effects Models: Applications to Identifying Placebo Responders , 2010, Journal of the American Statistical Association.

[57]  Georgios C. Anagnostopoulos,et al.  Knowledge-Based Intelligent Information and Engineering Systems , 2003, Lecture Notes in Computer Science.

[58]  T. Tarpey Linear Transformations and the k-Means Clustering Algorithm , 2007, American Statistician.

[59]  Bernard D. Flury,et al.  Principal Points and Self-Consistent Points of Elliptical Distributions , 1995 .

[60]  Masahiro Mizuta,et al.  Functional Clustering and Functional Principal Points , 2007, KES.

[61]  Haseong Kim,et al.  A method to identify differential expression profiles of time-course gene data with Fourier transformation , 2013, BMC Bioinformatics.

[62]  Suxia Han,et al.  fied, along with their associated protein‐protein interaction networks and Kyoto Encyclopedia of Genes and Genomes , 2019 .

[63]  Xiaohui Liu,et al.  Consensus clustering and functional interpretation of gene-expression data , 2004, Genome Biology.

[64]  Andrzej Kloczkowski,et al.  Functional clustering of yeast proteins from the protein-protein interaction network , 2006, BMC Bioinformatics.

[65]  Haixu Tang,et al.  A New Estimator of Significance of Correlation in Time Series Data , 2001, J. Comput. Biol..