Effective dimensionality of large-scale expression data using principal component analysis.

Large-scale expression data are today measured for thousands of genes simultaneously. This development is followed by an exploration of theoretical tools to get as much information out of these data as possible. One line is to try to extract the underlying regulatory network. The models used thus far, however, contain many parameters, and a careful investigation is necessary in order not to over-fit the models. We employ principal component analysis to show how, in the context of linear additive models, one can get a rough estimate of the effective dimensionality (the number of information-carrying dimensions) of large-scale gene expression datasets. We treat both the lack of independence of different measurements in a time series and the fact that that measurements are subject to some level of noise, both of which reduce the effective dimensionality and thereby constrain the complexity of models which can be built from the data.

[1]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[2]  H. Kantz,et al.  Nonlinear time series analysis , 1997 .

[3]  Joshua M. Stuart,et al.  MICROARRAY EXPERIMENTS : APPLICATION TO SPORULATION TIME SERIES , 1999 .

[4]  David H. Sharp,et al.  A connectionist model of development. , 1991, Journal of theoretical biology.

[5]  Marcel J. T. Reinders,et al.  A Comparison of Genetic Network Models , 2000, Pacific Symposium on Biocomputing.

[6]  Åke Björck,et al.  Numerical methods for least square problems , 1996 .

[7]  N. Draper,et al.  Applied Regression Analysis , 1967 .

[8]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[9]  M Wahde,et al.  Coarse-grained reverse engineering of genetic regulatory networks. , 2000, Bio Systems.

[10]  David H. Sharp,et al.  Mechanism of eve stripe formation , 1995, Mechanisms of Development.

[11]  J. Barker,et al.  Large-scale temporal gene expression mapping of central nervous system development. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Patrik D'haeseleer,et al.  Genetic network inference: from co-expression clustering to reverse engineering , 2000, Bioinform..

[13]  Patrik D'haeseleer,et al.  Linear Modeling of mRNA Expression Levels During CNS Development and Injury , 1998, Pacific Symposium on Biocomputing.

[14]  Neal S. Holter,et al.  Fundamental patterns underlying gene expression profiles: simplicity from complexity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Roland Baddeley,et al.  Nonlinear principal components analysis of neuronal spike train data , 1997, Biological Cybernetics.

[16]  Neal S. Holter,et al.  Dynamic modeling of gene expression data. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  J. Barker,et al.  Developmental kinetics of GAD family mRNAs parallel neurogenesis in the rat spinal cord , 1995, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[18]  G. P. King,et al.  Topological dimension and local coordinates from time series data , 1987 .

[19]  Zoltan Szallasi,et al.  Genetic Network Analysis in Light of Massively Parallel Biological Data Acquisition , 1998, Pacific Symposium on Biocomputing.

[20]  Michael E. Wall,et al.  SVDMAN-singular value decomposition analysis of microarray data , 2001, Bioinform..