Normalized Feature Vectors: A Novel Alignment-Free Sequence Comparison Method Based on the Numbers of Adjacent Amino Acids

Based on all kinds of adjacent amino acids (AAA), we map each protein primary sequence into a 400 by (L-1) matrix M. In addition, we further derive a normalized 400-tuple mathematical descriptors D, which is extracted from the primary protein sequences via singular values decomposition (SVD) of the matrix. The obtained 400-D normalized feature vectors (NFVs) further facilitate our quantitative analysis of protein sequences. Using the normalized representation of the primary protein sequences, we analyze the similarity for different sequences upon two data sets: 1) ND5 sequences from nine species and 2) transferrin sequences of 24 vertebrates. We also compared the results in this study with those from other related works. These two experiments illustrate that our proposed NFV-AAA approach does perform well in the field of similarity analysis of sequence.

[1]  Jia Wen,et al.  A 2D graphical representation of protein sequence and its numerical characterization , 2009 .

[2]  Ken D. Nguyen On the Edge of Web-Based Multiple Sequence Alignment Services , 2012 .

[3]  Zhu-Hong You,et al.  Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data , 2010, Bioinform..

[4]  Peng Chen,et al.  Predicting protein interaction sites from residue spatial sequence profile and evolution rate , 2006, FEBS Letters.

[5]  M. Novič,et al.  Representation of proteins as walks in 20-D space , 2008, SAR and QSAR in environmental research.

[6]  M. Ford,et al.  Molecular evolution of transferrin: evidence for positive selection in salmonids. , 2001, Molecular biology and evolution.

[7]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[8]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[9]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[10]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[11]  S. Basak,et al.  Mathematical descriptors of DNA sequences: development and applications , 2006 .

[12]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[13]  Xingming Zhao,et al.  Predicting protein–protein interactions from protein sequences using meta predictor , 2010, Amino Acids.

[14]  J. Qi,et al.  Whole genome molecular phylogeny of large dsDNA viruses using composition vector method , 2007, BMC Evolutionary Biology.

[15]  Armando J. Pinho,et al.  Genome analysis with inter-nucleotide distances , 2009, Bioinform..

[16]  Tianming Wang,et al.  Phylogenetic Analysis of Protein Sequences Based on Distribution of Length About Common Substring , 2011, The protein journal.

[17]  Z. Feng,et al.  Prediction of the subcellular location of prokaryotic proteins based on a new representation of the amino acid composition. , 2001, Biopolymers.

[18]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[19]  M. I. A. E. Maaty,et al.  3D graphical representation of protein sequences and their statistical characterization , 2010 .

[20]  Chun Li,et al.  Analysis of similarity/dissimilarity of protein sequences , 2008, Proteins.

[21]  Xingming Sun,et al.  A Novel method for similarity analysis and protein sub-cellular localization prediction , 2010, Bioinform..

[22]  Chenglong Yu,et al.  Protein map: an alignment-free sequence comparison method based on various properties of amino acids. , 2011, Gene.

[23]  Papiya Nandy,et al.  Numerical Characterization of Protein Sequences and Application to Voltage-Gated Sodium Channel α Subunit Phylogeny , 2009, Silico Biol..

[24]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[25]  D. Bielinska-Waz Graphical and numerical representations of DNA sequences: statistical aspects of similarity , 2011, Journal of mathematical chemistry.

[26]  S. Pääbo,et al.  Conflict Among Individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders , 1998, Journal of Molecular Evolution.

[27]  Milan Randic Condensed Representation of DNA Primary Sequences , 2000, J. Chem. Inf. Comput. Sci..

[28]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[29]  De-Shuang Huang,et al.  Novel 20-D descriptors of protein sequences and it’s applications in similarity analysis , 2012 .

[30]  Xing-Ming Zhao,et al.  APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility , 2010, BMC Bioinformatics.

[31]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[32]  M. Randic,et al.  Highly compact 2D graphical representation of DNA sequences , 2004, SAR and QSAR in environmental research.

[33]  Xuhua Xia,et al.  Protein structure, neighbor effect, and a new index of amino acid dissimilarities. , 2002, Molecular biology and evolution.

[34]  P. He,et al.  A novel graphical representation of proteins and its application , 2012 .

[35]  Kyungsook Han,et al.  Sequence-based prediction of protein-protein interactions by means of rotation forest and autocorrelation descriptor. , 2010, Protein and peptide letters.

[36]  Milan Randić,et al.  2-D Graphical representation of proteins based on physico-chemical properties of amino acids , 2007 .

[37]  Alexandru T Balaban,et al.  Graphical representation of proteins. , 2011, Chemical reviews.

[38]  Yanping Zhang,et al.  The graphical representation of protein sequences based on the physicochemical properties and its applications , 2010, J. Comput. Chem..

[39]  K. Chou,et al.  2D-MH: A web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids. , 2010, Journal of theoretical biology.