Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets
暂无分享,去创建一个
Ordination is a powerful method for analysing complex data sets but has been largely ignored in sequence analysis. This paper shows how to use principal coordinates analysis to find low-dimensional representations of distance matrices derived from aligned sets of sequences. The method takes a matrix of Euclidean distances between all pairs of sequence and finds a coordinate space where the distances are exactly preserved. The main problem is to find a measure of distance between aligned sequences that is Euclidean. The simplest distance function is the square root of the percentage difference (as measured by identities) between two sequences, where one ignores any positions in the alignment where there is a gap in any sequence. If one does not ignore positions with a gap, the distances cannot be guaranteed to be Euclidean but the deleterious effects are trivial. Two examples of using the method are shown. A set of 226 aligned globins were analysed and the resulting ordination very successfully represents the known patterns of relationship between the sequences. In the other example, a set of 610 aligned 5S rRNA sequences were analysed. Sequence ordinations complement phylogenetic analyses. They should not be viewed as a complete alternative.