Study on the Relation between Virus and Host Cell by Alignment-Free Sequence Comparison

With the development of the human genome project, scientists have obtained a large number of biological sequences which needed to be processed and analyzed. In this paper, we characterized the biological relationship between virus and host cell by using alignment-free sequence comparison statistics such as, D2S, D2*, Hao, Eu and Ch. By using sequence alignment-free comparison, we acquired sequence distributions of K-tuple in the tested species, and here K-tuple in biology is a short sequence of genes with a length of K. We used the ROC curves to analyze the data of DNA sequences of virus and host cell to find an optimal statistic method. The ROC curves show that when K is small, all of the statistics can better react the biological relationship of virus and host cell in the genetic similarity, but with K gradually increasing, the statistics D2S and D2* have been proved to be almost invariant and well in reflecting the biological relationship between virus and host cell in the genetic similarity, at the same time the other statistics reduced, which confirmed our conjecture, so we can consider the statistics D2S and D2* as effective alignment-free sequence comparison methods for the relation between virus and host cell.

[1]  W. Gilbert,et al.  A new method for sequencing DNA. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Bernhard Haubold,et al.  Alignment-free detection of local similarity among viral and bacterial genomes , 2011, Bioinform..

[3]  Gesine Reinert,et al.  New powerful statistics for alignment-free sequence comparison under a pattern transfer model. , 2011, Journal of theoretical biology.

[4]  M. Waterman,et al.  Distributional regimes for the number of k-word matches between two random sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Pearlly Yan,et al.  Methods for high-throughput MethylCap-Seq data analysis , 2012, BMC Genomics.

[6]  J. Dahlberg,et al.  Molecular biology. , 1977, Science.

[7]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[8]  E. D. Hyman A new method of sequencing DNA. , 1988, Analytical biochemistry.

[9]  Ying Wang,et al.  Effect of k-tuple length on sample-comparison with high-throughput sequencing data. , 2016, Biochemical and biophysical research communications.

[10]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[11]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[12]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Minghua Deng,et al.  Comparison of metagenomic samples using sequence signatures , 2012, BMC Genomics.

[14]  Kai Song,et al.  Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads , 2013, J. Comput. Biol..

[15]  Gesine Reinert,et al.  The Power of Detecting Enriched Patterns: An HMM Approach , 2010, J. Comput. Biol..