The Construction and Assessment of a Statistical Model for the Prediction of Protein Assay Data

The focus of this work is the development of a statistical model for a bioinformatics database whose distinctive structure makes model assessment an interesting and challenging problem. The key components of the statistical methodology, including a fast approximation to the singular value decomposition and the use of adaptive spline modeling and tree-based methods, are described, and preliminary results are presented. These results are shown to compare favorably to selected results achieved using comparitive methods. An attempt to determine the predictive ability of the model through the use of cross-validation experiments is discussed. In conclusion a synopsis of the results of these experiments and their implications for the analysis of bioinformatic databases in general is presented.

[1]  A. Abbott Structures by numbers , 2000, Nature.

[2]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[3]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[4]  Jennifer Pittman,et al.  Adaptive Splines and Genetic Algorithms , 2000 .

[5]  Diane Gershon,et al.  Structural genomics — from cottage industry to industrial revolution , 2000, Nature.

[6]  G. Stewart Updating a Rank-Revealing ULV Decomposition , 1993, SIAM J. Matrix Anal. Appl..

[7]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[8]  S. Young,et al.  Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning. , 2000 .

[9]  G. Cruciani,et al.  Generating Optimal Linear PLS Estimations (GOLPE): An Advanced Chemometric Tool for Handling 3D‐QSAR Problems , 1993 .

[10]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[11]  Hilko van der Voet,et al.  Comparing the predictive accuracy of models using a simple randomization test , 1994 .

[12]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..

[13]  M. Hagmann,et al.  Computers Aid Vaccine Design , 2000, Science.

[14]  R. Wehrens,et al.  Bootstrapping principal component regression models , 1997 .

[15]  G. W. Stewart,et al.  An updating algorithm for subspace tracking , 1992, IEEE Trans. Signal Process..

[16]  S. Wold Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models , 1978 .

[17]  Jianming Ye On Measuring and Correcting the Effects of Data Mining and Model Selection , 1998 .