Alignment free comparison: k word voting model and its applications.

Alignment free sequence comparison is widely used in sequence analysis, especially in computational biology for large scale similarity comparison. In this paper, we propose a word voting model to compare the biological sequences without alignment. Unlike many comparison methods based on the k word, this model does not use the k word frequency or statistics. Thus there is no limitation on the choice of k. Instead, we used information entropy of gamma distribution to characterize the differences among biological sequences in this model. Finally, we employed the model to do the similarity search and phylogenetic tree construction to further validate the model.

[1]  Qian-zhong Li,et al.  Predict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou's pseudo amino acid composition. , 2012, Journal of theoretical biology.

[2]  Robert C. Edgar,et al.  Local homology recognition and distance measures in linear time using compressed amino acid alphabets. , 2004, Nucleic acids research.

[3]  Kuo-Chen Chou,et al.  Predicting membrane protein types by the LLDA algorithm. , 2008, Protein and peptide letters.

[4]  Se-Ran Jun,et al.  Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution , 2009, Proceedings of the National Academy of Sciences.

[5]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[6]  Wei Chen,et al.  iNuc-PhysChem: A Sequence-Based Predictor for Identifying Nucleosomes via Physicochemical Properties , 2012, PloS one.

[7]  Shao-Ping Shi,et al.  Identifying protein quaternary structural attributes by incorporating physicochemical properties into the general form of Chou's PseAAC via discrete wavelet transform. , 2012, Molecular bioSystems.

[8]  Xin Wang,et al.  Recent progress in predicting protein sub-subcellular locations , 2011, Expert review of proteomics.

[9]  K. Chou,et al.  ProtIdent: a web server for identifying proteases and their types by fusing functional domain and sequential evolution information. , 2008, Biochemical and biophysical research communications.

[10]  Tian-ming Wang,et al.  Vector representations and related matrices of DNA primary sequence based on L-tuple. , 2010, Mathematical biosciences.

[11]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[12]  Antonio Restivo,et al.  Distance measures for biological sequences: Some recent approaches , 2008, Int. J. Approx. Reason..

[13]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[14]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[15]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[16]  Kuo-Chen Chou,et al.  Identification of proteases and their types. , 2009, Analytical biochemistry.

[17]  Xingming Sun,et al.  A Novel method for similarity analysis and protein sub-cellular localization prediction , 2010, Bioinform..

[18]  L. Luo,et al.  Genome-based phylogeny of dsDNA viruses by a novel alignment-free method. , 2012, Gene.

[19]  Judith Klein-Seetharaman,et al.  PROTEINS: Structure, Function, and Bioinformatics 58:955–970 (2005) Protein Classification Based on Text Document Classification Techniques , 2022 .

[20]  K. Chou,et al.  REVIEW : Recent advances in developing web-servers for predicting protein attributes , 2009 .

[21]  K. Chou,et al.  Prediction and classification of domain structural classes , 1998, Proteins.

[22]  Yanda Li,et al.  Prediction of C-to-U RNA editing sites in higher plant mitochondria using only nucleotide sequence features. , 2007, Biochemical and biophysical research communications.

[23]  Yanda Li,et al.  SubChlo: predicting protein subchloroplast locations with pseudo-amino acid composition and the evidence-theoretic K-nearest neighbor (ET-KNN) algorithm. , 2009, Journal of theoretical biology.

[24]  Kuo-Chen Chou,et al.  Prediction of Protein Domain with mRMR Feature Selection and Analysis , 2012, PloS one.

[25]  S. Pääbo,et al.  Conflict Among Individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders , 1998, Journal of Molecular Evolution.

[26]  K. Chou,et al.  Using maximum entropy model to predict protein secondary structure with single sequence. , 2009, Protein and peptide letters.

[27]  K. Chou,et al.  Signal-3L: A 3-layer approach for predicting signal peptides. , 2007, Biochemical and biophysical research communications.

[28]  Tianming Wang,et al.  A simple k-word interval method for phylogenetic analysis of DNA sequences. , 2013, Journal of theoretical biology.

[29]  K. Chou,et al.  Predicting protein-protein interactions from sequences in a hybridization space. , 2006, Journal of proteome research.

[30]  Kuo-Chen Chou,et al.  HIVcleave: a web-server for predicting human immunodeficiency virus protease cleavage sites in proteins. , 2008, Analytical biochemistry.

[31]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[32]  Mark Borodovsky,et al.  Statistical significance in biological sequence analysis , 2006, Briefings Bioinform..

[33]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[34]  K. Chou,et al.  Knowledge-based computational intelligence development for predicting protein secondary structures from sequences , 2008, Expert review of proteomics.

[35]  Xiaowei Zhao,et al.  Predicting protein-protein interactions by combing various sequence- derived features into the general form of Chou's Pseudo amino acid composition. , 2012, Protein and peptide letters.

[36]  Kuo-Chen Chou,et al.  GPCR-2L: predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions. , 2011, Molecular bioSystems.

[37]  Kuo-Chen Chou,et al.  Boosting classifier for predicting protein domain structural class. , 2005, Biochemical and biophysical research communications.

[38]  K. Chou,et al.  Predicting protein fold pattern with functional domain and sequential evolution information. , 2009, Journal of theoretical biology.

[39]  Kuo-Chen Chou,et al.  Insights from investigating the interactions of adamantane-based drugs with the M2 proton channel from the H1N1 swine virus. , 2009, Biochemical and biophysical research communications.

[40]  Xiangde Zhang,et al.  Alignment free comparison: similarity distribution between the DNA primary sequences based on the shortest absent word. , 2012, Journal of theoretical biology.

[41]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[42]  Kuo-Chen Chou,et al.  A Multi-Label Classifier for Predicting the Subcellular Localization of Gram-Negative Bacterial Proteins with Both Single and Multiple Sites , 2011, PloS one.

[43]  Kuo-Chen Chou Insights from modeling three-dimensional structures of the human potassium and sodium channels. , 2004, Journal of proteome research.

[44]  Qi Dai,et al.  Using Markov model to improve word normalization algorithm for biological sequence comparison , 2011, Amino Acids.

[45]  K. Chou,et al.  iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins , 2011, PloS one.

[46]  Tianming Wang,et al.  Weighted Relative Entropy for Alignment-free Sequence Comparison Based on Markov Model , 2011, Journal of biomolecular structure & dynamics.

[47]  Lei Zhu,et al.  Using Gaussian model to improve biological sequence comparison , 2010, J. Comput. Chem..

[48]  Kuo-Chen Chou,et al.  MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. , 2007, Biochemical and biophysical research communications.

[49]  Yanping Zhang,et al.  The graphical representation of protein sequences based on the physicochemical properties and its applications , 2010, J. Comput. Chem..

[50]  Tuan D. Pham,et al.  A probabilistic measure for alignment-free sequence comparison , 2004, Bioinform..

[51]  Martin Vingron,et al.  Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts , 2012, Bioinform..

[52]  Kuo-Bin Li,et al.  Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou's pseudo amino acid composition. , 2013, Journal of theoretical biology.

[53]  Xiangde Zhang,et al.  Large Local Analysis of the Unaligned Genome and Its Application , 2013, J. Comput. Biol..

[54]  Yanchun Yang,et al.  Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , 2008, Bioinform..

[55]  K. Chou,et al.  Analysis and Prediction of the Metabolic Stability of Proteins Based on Their Sequential Features, Subcellular Locations and Interaction Networks , 2010, PloS one.

[56]  K. Chou,et al.  Predicting the quaternary structure attribute of a protein by hybridizing functional domain composition and pseudo amino acid composition , 2009 .

[57]  Tianming Wang,et al.  A novel statistical measure for sequence comparison on the basis of k-word counts. , 2013, Journal of theoretical biology.

[58]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[59]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[60]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[61]  C. Gissi,et al.  Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris. , 2000, Molecular biology and evolution.

[62]  Gesine Reinert,et al.  New powerful statistics for alignment-free sequence comparison under a pattern transfer model. , 2011, Journal of theoretical biology.