Reducing the effect of the data order in algorithms for constructing phylogenetic trees

A major objective of biological systematics is to infer phylogenies (evolutionary trees) from available data on the species under investigation. This involves searching an unknown tree from data generated by a stochastic process which operates along the tree. These evolutionary trees are binary trees where the tips are occupied by the species or OTUs (Operational Taxonomic Units), in our case represented by DNA or RNA sequences, and the positions joining them until the root by their hypothetical ancestors or HTUs (Hypothetical Taxonomic Units). Since the number of possible topologies for this tree is so great that it is not possible to examine them all in a reasonable time, methods to infer tree topologies are only approximate. A common problem with these algorithms is their dependence on data order. The program shown here describes an implementation to reduce this dependence in the Camin —Sokal parsimony method, although it may also be used with other methods. The program was written in Pascal using a Turbo Pascal compiler (version 3.01A) from Borland International, to run on an IBM PC or true compatible. Lists of pointers were used to store sequences instead of arrays. The array is a data structure more commonly used but it may, in some cases, cause a squandering of computer memory and, in others, have too short a length for the sequences. The sites in the sequence are represented by Pascal sets which can take the following values: (AJ, (G), |C) and {U or TJ, or their set unions, for instance f AGj. This is the case when the sites belong to a HTU which joins two sequences in a tree structure. The Camin —Sokal algorithm is recursive, and it consists of (i) adding a new branch to a tree in the better position found after a tree examination from the root to the tips, and (ii) if possible, rearranging the tree until a better order is found. If we add the branches by following Wagner's method, instead of in a sequential way, we can considerably reduce the effect of the data order in the tree topology. This method consists of making an initial tree with the two sequences with a minimum Hamming distance, i.e. the number of sites at which these two sequences differ are minimal, and adding new sequences (branches) chosen by its minimum Hamming distance to the Do the first tree Repeat. Choose the QTU with minimum Hamming distance to the root Find a sui table place "for it Insert the OTU in this place try to rearrange the tree until (number of OTUs 2) times