Identification of protein coding regions in the human genome by quadratic discriminant analysis.

A new method for predicting internal coding exons in genomic DNA sequences has been developed. This method is based on a prediction algorithm that uses the quadratic discriminant function for multivariate statistical pattern recognition. Substantial improvements have been made (with only 9 discriminant variables) when compared with existing methods: HEXON [Solovyev, V. V., Salamov, A. A. & Lawrence, C. B. (1994) Nucleic Acids Res. 22, 5156-5163] (based on linear discriminant analysis) and GRAIL2 [Uberbacher, E. C. & Mural, R. J. (1991) Proc. Natl. Acad. Sci. USA 88, 11261-11265] (based on neural networks). A computer program called MZEF is freely available to the genome community and allows users to adjust prior probability and to output alternative overlapping exons.

[1]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[2]  Christopher J. Rawlings,et al.  Nucleic Acid and Protein Sequence Analysis , 1987 .

[3]  Brian Everitt,et al.  Principles of Multivariate Analysis , 2001 .

[4]  C Burks,et al.  Electronic data publishing and GenBank. , 1991, Science.

[5]  J. Weissenbach,et al.  The candidate gene for the X-linked Kallmann syndrome encodes a protein related to adhesion molecules , 1991, Cell.

[6]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[7]  P. Sharp,et al.  Exon amplification: a strategy to isolate mammalian genes based on RNA splicing. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[8]  M. Lovett,et al.  The selective isolation of novel cDNAs encoded by the regions surrounding the human interleukin 4 and 5 genes. , 1992, Nucleic acids research.

[9]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[10]  James M. Sikela,et al.  Single pass sequencing and physical and genetic mapping of human brain cDNAs , 1992, Nature Genetics.

[11]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[12]  J. Craig Venter,et al.  Sequence identification of 2,375 human brain genes , 1992, Nature.

[13]  AC Tose Cell , 1993, Cell.

[14]  D. Krizman,et al.  Efficient selection of 3'-terminal exons from vertebrate DNA. , 1993, Nucleic acids research.

[15]  J. Rommens,et al.  A transcription map of the region containing the Huntington disease gene. , 1993, Human molecular genetics.

[16]  S. Weissman,et al.  Application of cDNA selection techniques to regions of the human MHC. , 1993, Genomics.

[17]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[18]  J. Claverie,et al.  A streamlined random sequencing strategy for finding coding exons. , 1994, Genomics.

[19]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[20]  P. Clark Fast glacier flow over soft beds. , 1995, Science.

[21]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[22]  Getting the message: identifying transcribed sequences. , 1995, Trends in genetics : TIG.

[23]  Rene Devos,et al.  Identification and expression cloning of a leptin receptor, OB-R , 1995, Cell.

[24]  D. Bentley,et al.  Identification of the breast cancer susceptibility gene BRCA2 , 1995, Nature.

[25]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[26]  R. Quatrano Genomics , 1998, Plant Cell.