Predicting allergenic proteins using wavelet transform

MOTIVATION With many transgenic proteins introduced today, the ability to predict their potential allergenicity has become an important issue. Previous studies were based on either sequence similarity or the protein motifs identified from known allergen databases. The similarity-based approaches, although being able to produce high recalls, usually have low prediction precisions. Previous motif-based approaches have been shown to be able to improve the precisions on cross-validation experiments. In this study, a system that combines the advantages of similarity-based and motif-based prediction is described. RESULTS The new prediction system uses a clustering algorithm that groups the known allergenic proteins into clusters. Proteins within each cluster are assumed to carry one or more common motifs. After a multiple sequence alignment, proteins in each cluster go through a wavelet analysis program whereby conserved motifs will be identified. A hidden Markov model (HMM) profile will then be prepared for each identified motif. The allergens that do not appear to carry detectable allergen motifs will be saved in a small database. The allergenicity of an unknown protein may be predicted by comparing it against the HMM profiles, and, if no matching profiles are found, against the small allergen database by BLASTP. Over 70% of recall and over 90% of precision were observed using cross-validation experiments. Using the entire Swiss-Prot as the query, we predicted about 2000 potential allergens. AVAILABILITY The software is available upon request from the authors.

[1]  S. Gendel,et al.  Sequence Analysis for Assessing Potential Allergenicity , 2002, Annals of the New York Academy of Sciences.

[2]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[3]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[4]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[5]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[6]  R. Aalberse,et al.  Structural biology of allergens. , 2000, The Journal of allergy and clinical immunology.

[7]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[8]  Santa Jeremy Ono,et al.  Molecular genetics of allergic diseases. , 2000, Annual review of immunology.

[9]  Michael B. Stadler,et al.  Allergenicity prediction by protein sequence , 2003, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[10]  T. Cover LEARNING IN PATTERN RECOGNITION , 1969 .

[11]  R W R Crevel,et al.  Assessment of the potential allergenicity of ice structuring protein type III HPLC 12 using the FAO/WHO 2001 decision tree for novel foods. , 2003, Food and chemical toxicology : an international journal published for the British Industrial Biological Research Association.

[12]  A. Silvanovich,et al.  Bioinformatic Methods for Allergenicity Assessment Using a Comprehensive Allergen Database , 2002, International Archives of Allergy and Immunology.

[13]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[14]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[15]  Werner Braun,et al.  Data mining of sequences and 3D structures of allergenic proteins , 2002, Bioinform..

[16]  T. P. King,et al.  Structure and Biology of Stinging Insect Venom Allergens , 2000, International Archives of Allergy and Immunology.

[17]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[18]  Arun Krishnan,et al.  Rapid detection of conserved regions in protein sequences using wavelets , 2004, Silico Biol..