Multipattern Consensus Regions in Multiple Aligned Protein Sequences and Their Segmentation

Decomposing a biological sequence into its functional regions is an important prerequisite to understand the molecule. Using the multiple alignments of the sequences, we evaluate a segmentation based on the type of statistical variation pattern from each of the aligned sites. To describe such a more general pattern, we introduce multipattern consensus regions as segmented regions based on conserved as well as interdependent patterns. Thus the proposed consensus region considers patterns that are statistically significant and extends a local neighborhood. To show its relevance in protein sequence analysis, a cancer suppressor gene called p53 is examined. The results show significant associations between the detected regions and tendency of mutations, location on the 3D structure, and cancer hereditable factors that can be inferred from human twin studies.

[1]  P. Jeffrey,et al.  Crystal structure of a p53 tumor suppressor-DNA complex: understanding tumorigenic mutations. , 1994, Science.

[2]  B. Vogelstein,et al.  p53 mutations in human cancers. , 1991, Science.

[3]  Andrew K. C. Wong,et al.  Multiple pattern associations for interpreting structural and functional characteristics of biomolecules , 2004, Inf. Sci..

[4]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[5]  Yoichiro Takada ON THE MATHEMATICAL THEORY OF COMMUNICATION , 1954 .

[6]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[7]  P. Sparén,et al.  Genetic link to cervical tumours , 1999, Nature.

[8]  M. Kendall Probability and Statistical Inference , 1956, Nature.

[9]  A. Levine p53, the Cellular Gatekeeper for Growth and Division , 1997, Cell.

[10]  A. Wong,et al.  Statistical analysis of residue variability in cytochrome c. , 1976, Journal of molecular biology.

[11]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[12]  David K. Y. Chiu,et al.  Inferring consensus structure from nucleic acid sequences , 1991, Comput. Appl. Biosci..

[13]  C. Harris,et al.  Mutations in the p53 tumor suppressor gene: clues to cancer etiology and molecular pathogenesis. , 1994, Cancer research.

[14]  D. Lane,et al.  The p53 tumour suppressor gene , 1998, The British journal of surgery.

[15]  Ivo Grosse,et al.  Applications of Recursive Segmentation to the Analysis of DNA Sequences , 2002, Comput. Chem..

[16]  L. L. Gatlin,et al.  The information content of DNA. II. , 1968, Journal of theoretical biology.

[17]  S. Kato,et al.  The UMD TP53 database and website: update and revisions , 2006, Human mutation.

[18]  Jian Zhang Analysis of Information Content for Biological Sequences , 2003, J. Comput. Biol..

[19]  J. Kaprio,et al.  Environmental and heritable factors in the causation of cancer--analyses of cohorts of twins from Sweden, Denmark, and Finland. , 2000, The New England journal of medicine.

[20]  David K. Y. Chiu,et al.  A method for inferring probabilistic consensus structure with applications to molecular sequence data , 1993, Pattern Recognit..

[21]  L. L. Gatlin,et al.  The information content of DNA. , 1966, Journal of theoretical biology.

[22]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[23]  Yang Wang,et al.  High-Order Pattern Discovery from Discrete-Valued Data , 1997, IEEE Trans. Knowl. Data Eng..

[24]  Richard J Boys,et al.  A Bayesian Approach to DNA Sequence Segmentation , 2004, Biometrics.

[25]  David K Y Chiu,et al.  A Multiple-Pattern Biosequence Analysis Method for Diverse Source Association Mining , 2005, Applied bioinformatics.

[26]  P. J. Green,et al.  Probability and Statistical Inference , 1978 .

[27]  S. Haberman The Analysis of Residuals in Cross-Classified Tables , 1973 .