Application of canonical correlation analysis for identifying viral integration preferences

MOTIVATION Gene therapy aims at using viral vectors for attaching helpful genetic code to target genes. Therefore, it is of great importance to develop methods that can discover significant patterns around viral integration sites. Canonical correlation analysis is an unsupervised statistical tool that is used to describe the relations between two related views of the same semantic object, which fits well for identifying such salient patterns. RESULTS Proposed method is demonstrated on a sequence dataset obtained from a study on HIV-1 preferred integration regions. The subsequences on the left and right sides of the integration points are given to the method as the two views, and statistically significant relations are found between sequence-driven features derived from these two views, which suggest that the viral preference must be the factor responsible for this correlation. We found that there are significant correlations at x=5 indicating a palindromic behavior surrounding the viral integration site, which complies with the previously reported results. AVAILABILITY Developed software tool is available at http://ce.istanbul.edu.tr/bioinformatics/hiv1/.

[1]  D. Tritchler,et al.  Sparse Canonical Correlation Analysis with Application to Genomic Data Integration , 2009, Statistical applications in genetics and molecular biology.

[2]  John M. Coffin,et al.  Symmetrical base preferences surrounding HIV-1, avian sarcoma/leukosis virus, and murine leukemia virus integration sites , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[4]  Chenglong Yu,et al.  A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications , 2011, PloS one.

[5]  S. Burgess,et al.  Weak Palindromic Consensus Sequences Are a Common Feature Found at the Integration Target Sites of Many Retroviruses , 2005, Journal of Virology.

[6]  Hans Knutsson,et al.  A canonical correlation approach to blind source separation , 2001 .

[7]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning , 2008 .

[8]  L. Jiang,et al.  PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[9]  Z. R. Li,et al.  Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[10]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[11]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[12]  Ethem Alpaydin,et al.  Canonical correlation analysis using within-class coupling , 2011, Pattern Recognit. Lett..

[13]  Quan Pan,et al.  Prediction of Protein Subcellular Localizations Using Moment Descriptors and Support Vector Machine , 2006, PRIB.

[14]  Paul Shinn,et al.  HIV-1 Integration in the Human Genome Favors Active Genes and Local Hotspots , 2002, Cell.

[15]  Qianqian Peng,et al.  A gene-based method for detecting gene–gene co-association in a case–control association study , 2010, European Journal of Human Genetics.

[16]  Gary D. Stormo,et al.  Identifying target sites for cooperatively binding factors , 2001, Bioinform..

[17]  R. C. Sprinthall Basic Statistical Analysis , 1982 .

[18]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[19]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[20]  Scott T. Weiss,et al.  Using Canonical Correlation Analysis to Discover Genetic Regulatory Variants , 2010, PloS one.