An efficient conserved region detection method for multiple protein sequences using principal component analysis and wavelet transform

This paper proposes an efficient conserved region detection method for multiple protein sequences. Instead of detecting conserved regions directly from the set of all participatory protein sequences, the proposed method separates the detection process as two stages. In the fist stage, a serial of principal component analysis (PCA) techniques are applied to infer the common ancestor protein from the participatory proteins based on a hypothetical evolutionary history. Then, wavelet transform is employed to derive conserved regions from the common ancestor protein in the second stage. The detected conserved regions are considered as the common conserved regions of the original protein sequences. A set of experiments indicate that the two stage strategy makes the proposed method not only prevents the residue divergence problem but also increases the detection accuracy and efficiency.

[1]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[2]  S. Karlin,et al.  Evolutionary conservation of RecA genes in relation to protein structure and function , 1996, Journal of bacteriology.

[3]  I. Jolliffe Principal Component Analysis , 2002 .

[4]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[5]  R. Ranganathan,et al.  Evolutionarily conserved pathways of energetic connectivity in protein families. , 1999, Science.

[6]  J. Thompson,et al.  The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. , 1997, Nucleic acids research.

[7]  M. L. Jones,et al.  PDBsum: a Web-based database of summaries and analyses of all PDB structures. , 1997, Trends in biochemical sciences.

[8]  T. T. Wu,et al.  AN ANALYSIS OF THE SEQUENCES OF THE VARIABLE REGIONS OF BENCE JONES PROTEINS AND MYELOMA LIGHT CHAINS AND THEIR IMPLICATIONS FOR ANTIBODY COMPLEMENTARITY , 1970, The Journal of experimental medicine.

[9]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[10]  Julie Dawn Thompson,et al.  Improved sensitivity of profile searches through the use of sequence weights and gap excision , 1994, Comput. Appl. Biosci..

[11]  A. Haar Zur Theorie der orthogonalen Funktionensysteme , 1910 .

[12]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[13]  D. S. Guru,et al.  An invariant scheme for exact match retrieval of symbolic images based upon principal component analysis , 2004, Pattern Recognit. Lett..

[14]  D. K. Y. Chiu,et al.  A survey of multiple sequence comparison methods , 1992 .

[15]  M. Gerstein,et al.  Average core structures and variability measures for protein families: application to the immunoglobulins. , 1995, Journal of molecular biology.

[16]  R. M. Williamson Information theory analysis of the relationship between primary sequence structure and ligand recognition among a class of facilitated transporters. , 1995, Journal of theoretical biology.

[17]  David T. Jones,et al.  Bioinformatics: Genes, Proteins and Computers , 2007 .

[18]  E. Kabat,et al.  Sequences of proteins of immunological interest , 1991 .

[19]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[20]  L. Mirny,et al.  Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. , 1999, Journal of molecular biology.

[21]  A. Lesk,et al.  Determinants of a protein fold. Unique features of the globin amino acid sequences. , 1987, Journal of molecular biology.

[22]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[23]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[24]  B. Erman,et al.  Information‐theoretical entropy as a measure of sequence variability , 1991, Proteins.

[25]  F. Cohen,et al.  An evolutionary trace method defines binding surfaces common to protein families. , 1996, Journal of molecular biology.

[26]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[27]  E. D. Van Rest,et al.  Methods of Statistical Analysis , 1954 .

[28]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[29]  D Fischer,et al.  Analysis of heregulin symmetry by weighted evolutionary tracing. , 1999, Protein engineering.

[30]  P. Alzari,et al.  Resolution of hypervariable regions in T-cell receptor beta chains by a modified Wu-Kabat index of amino acid diversity. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[31]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[32]  N. Ben-Tal,et al.  ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. , 2001, Journal of molecular biology.

[33]  Jeff Fortuna,et al.  Improved support vector classification using PCA and ICA feature space modification , 2004, Pattern Recognit..

[34]  I. Cosic Macromolecular bioactivity: is it resonant interaction between macromolecules?-theory and applications , 1994, IEEE Transactions on Biomedical Engineering.