Integrating Overlapping Structures and Background Information of Words Significantly Improves Biological Sequence Comparison

Word-based models have achieved promising results in sequence comparison. However, as the important statistical properties of words in biological sequence, how to use the overlapping structures and background information of the words to improve sequence comparison is still a problem. This paper proposed a new statistical method that integrates the overlapping structures and the background information of the words in biological sequences. To assess the effectiveness of this integration for sequence comparison, two sets of evaluation experiments were taken to test the proposed model. The first one, performed via receiver operating curve analysis, is the application of proposed method in discrimination between functionally related regulatory sequences and unrelated sequences, intron and exon. The second experiment is to evaluate the performance of the proposed method with f-measure for clustering Hepatitis E virus genotypes. It was demonstrated that the proposed method integrating the overlapping structures and the background information of words significantly improves biological sequence comparison and outperforms the existing models.

[1]  Pierre Brézellec,et al.  Cluster-C, an algorithm for the large-scale clustering of protein sequences based on the extraction of maximal cliques , 2004, Comput. Biol. Chem..

[2]  Anna-Malin Linde,et al.  Molecular characterization and phylogenetic analysis of the complete genome of a hepatitis E virus from European swine , 2008, Virus Genes.

[3]  Alexander Schliep,et al.  ProClust: improved clustering of protein sequences with an extended graph-based approach , 2002, ECCB.

[4]  Xiao Sun,et al.  A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping. , 2008, Biochemical and Biophysical Research Communications - BBRC.

[5]  N. Wahlberg,et al.  Phylogeny, classification and evolutionary insights into pestiviruses. , 2009, Virology.

[6]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[7]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[8]  Tiee-Jian Wu,et al.  Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition , 2001, Biometrics.

[9]  H Fushimi,et al.  Phylogenetic analysis based on 18S rRNA gene and matK gene sequences of Panax vietnamensis and five related species. , 2001, Planta medica.

[10]  J. Wootton Introduction to computational biology: Maps, sequences and genomes; Interdisciplinary statistics , 1997 .

[11]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[12]  Tuan D. Pham,et al.  A probabilistic measure for alignment-free sequence comparison , 2004, Bioinform..

[13]  Ji Qi,et al.  Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[14]  Yanchun Yang,et al.  Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , 2008, Bioinform..

[15]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[16]  Dong Xu,et al.  Phylogenetic analysis using complete signature information of whole genomes and clustered Neighbour-Joining method , 2006, Int. J. Bioinform. Res. Appl..

[17]  Sophie Schbath,et al.  An Efficient Statistic to Detect Over-and Under-Represented Words in DNA Sequences , 1997, J. Comput. Biol..

[18]  Hong Yan,et al.  Classification of short human exons and introns based on statistical features. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[19]  Xiang Fang,et al.  An improved string composition method for sequence comparison , 2008, BMC Bioinformatics.

[20]  Jean-Jacques Daudin,et al.  Exact distribution of word occurrences in a random sequence of letters , 1999, Journal of Applied Probability.

[21]  Hong Yan,et al.  Segmentation of short human exons based on spectral features of double curves , 2008, Int. J. Data Min. Bioinform..

[22]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..

[23]  Steven E. Brenner,et al.  Bootstrapping and normalization for enhanced evaluations of pairwise sequence comparison , 2002, Proc. IEEE.

[24]  Long Li,et al.  REDfly: a Regulatory Element Database for Drosophila , 2006, Bioinform..

[25]  Xin He,et al.  MORPH: Probabilistic Alignment Combined with Hidden Markov Models of cis-Regulatory Modules , 2007, PLoS Comput. Biol..

[26]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[27]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[28]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[29]  Charlotte M. Deane,et al.  Using Phylogeny to Improve Genome-Wide Distant Homology Recognition , 2006, PLoS Comput. Biol..

[30]  M. Cortey,et al.  Applying phylogenetic analysis to viral livestock diseases: moving beyond molecular typing. , 2010, Veterinary journal.

[31]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[32]  Sarah A. Teichmann,et al.  3D Complex: A Structural Classification of Protein Complexes , 2006, PLoS Comput. Biol..

[33]  Alberto Apostolico,et al.  Fast algorithms for computing sequence distances by exhaustive substring composition , 2008, Algorithms for Molecular Biology.

[34]  Mark Craven,et al.  Correction: Similarity Queries for Temporal Toxicogenomic Expression Profiles , 2008, PLoS Computational Biology.

[35]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[36]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[37]  Jacques van Helden,et al.  Metrics for comparing regulatory sequences on the basis of pattern counts , 2004, Bioinform..

[38]  Nilesh J. Samani,et al.  Sequence analysis Complementary intron sequence motifs associated with human exon repetition : a role for intragenic , inter-transcript interactions in gene expression , 2007 .

[39]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[40]  Gaston H. Gonnet,et al.  A Phylogenomic Study of Human, Dog, and Mouse , 2006, PLoS Comput. Biol..

[41]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[42]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[43]  Ferenc Livák,et al.  Evolutionarily conserved pattern of gene segment usage within the mammalian TCRβ locus , 2003, Immunogenetics.

[44]  Sudhir Kumar,et al.  MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment , 2004, Briefings Bioinform..

[45]  Elena Rivas,et al.  Probabilistic Phylogenetic Inference with Insertions and Deletions , 2008, PLoS Comput. Biol..

[46]  A. Y. Mitrophanov,et al.  Sensitivity and convergence of uniformly ergodic Markov chains , 2005 .

[47]  Mark Craven,et al.  Similarity Queries for Temporal Toxicogenomic Expression Profiles , 2008, PLoS Comput. Biol..

[48]  Christian Gautier,et al.  Statistical method for predicting protein coding regions in nucleic acid sequences , 1987, Comput. Appl. Biosci..

[49]  Kun-Mao Chao,et al.  Sequence Comparison - Theory and Methods , 2008, Computational Biology.

[50]  Tuan D. Pham,et al.  Spectral distortion measures for biological sequence comparisons and database searching , 2007, Pattern Recognit..

[51]  Mark Borodovsky,et al.  SENSITIVITY OF HIDDEN MARKOV MODELS , 2005 .

[52]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[53]  C. Hagedorn,et al.  Phylogenetic analysis of global hepatitis E virus sequences: genetic diversity, subtypes and zoonosis , 2006, Reviews in medical virology.

[54]  M. Waterman,et al.  Distributional regimes for the number of k-word matches between two random sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Michael S. Waterman,et al.  Introduction to Computational Biology: Maps, Sequences and Genomes , 1998 .

[56]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[57]  Hong Yan,et al.  Studies of spectral properties of short genes using the wavelet subspace Hilbert–Huang transform (WSHHT) , 2008 .

[58]  Alessandra Carbone,et al.  Joint Evolutionary Trees: A Large-Scale Method To Predict Protein Interfaces Based on Sequence Sampling , 2009, PLoS Comput. Biol..

[59]  J. Felsenstein Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. , 1996, Methods in enzymology.