Inferring Complex Phylogenies Using Parsimony : An Empirical Approach Using Three Large DNA Data Sets for Angiosperms

Ð To explore the feasibility of parsimony analysis for large data sets, we conducted heuristic parsimony searches and bootstrap analyses on separate and com bined DNA data sets for 190 angiosperm s and three outgroups. Separate data sets of 18S rDNA (1,855 bp), rbcL (1,428 bp), and atpB (1,450 bp) sequences were com bined into a single m atrix 4,733 bp in length. Analyses of the com bined data set show great improvements in computer run times compared to those of the separate data sets and of the data sets com bined in pairs. Six searches of the 18S rDNA 1 rbcL 1 atpB data set were conducted; in all cases TBR branch swapping was completed, generally w ithin a few days. In contrast, TBR branch swapping was not completed for any of the three separate data sets, or for the pairwise combined data sets. These results illustrate that it is possible to conduct a thorough search of tree space w ith large data sets, given suf® cient signal. In this case, and probably most others, suf® cient signal for a large num ber of taxa can only be obtained by com bining data sets. The com bined data sets also have higher internal support for clades than the separate data sets, and m ore clades receive bootstrap support of $ 50% in the com bined analysis than in analyses of the separate data sets. These data suggest that one solution to the computational and analytical dilemm as posed by large data sets is the addition of nucleotides, as well as taxa. [Large data sets, parsimony, phylogeny.] Phylogenetic relationships in many large groups of organisms remain enigmatic despite intensive study. Elucidating relationships in these groups will ultimately require the compilation and phylogenetic analysis of sequences and/or morphological traits represen ting hundreds of taxa. The feasibility of phylogenetic analysis of such large data sets has been debated, however (e.g., Patterson et al., 1993; H illis et al., 1994; Hillis, 1995). For example, Hillis et al. (1994) suggested that in some instances correct phylogeny reconstruction for only four taxa would require over 10,000 bp of DNA sequence. This degree of complexity implied much greater dif® culty with larger data sets and stimulated some to propose that phylogenetic problem s be broken into a series of sm aller problem s (e.g., Mishler, 1994; Kim , 1996; Soltis and Soltis, 1996; Rice et al., 1997), one extreme being a large number of four-taxon questions (e.g., Graur et al., 1996). Large data sets also pose problems for parsim ony analyses because of the large num ber of trees that must be examined in searching for the shortest tree(s). The num ber of potential solutions increases logarithm ically as taxa are added (Felsenstein, 1978). For example, for 20 taxa there are approximately 8.87 3 10 2 3 possible rooted trees (Felsenstein, 1978); for 228 taxa (the num ber of species recently analyzed by Soltis et al., 1997b, in a phylogenetic analysis of angiosperm s using nuclear 18S ribosomal DNA (rDNA) sequences), there are approxim ately 1.2 3 10 2 solutions (Hillis, 1996). Despite the dire predictions suggested for some four-taxon analyses (Hillis et al., 1994) and the num ber of possible trees for large data sets (Felsenstein , 1978), several 1998 33 SOLTIS ET AL.Ð LARGE DNA DATA SETS phylogenetic analyses involving hundreds of species have been conducted for angiosperm s, and the resu lts of these studies have important implications for the analysis of large data sets. Analyses of three large DNA data sets (each with over 200 species) have been conducted, using the plastid genes rbcL (Chase et al., 1993) and atpB (Savolainen et al., 1996) and the nuclear 18S rDNA (Soltis et al., 1997b). All three analyses have yielded highly sim ilar topologies for the angiosperms (reviewed by Chase and Cox, 1997; Soltis et al., 1997a; Chase and Albert, 1998). A more lengthy analysis of the 499-taxon rbcL data set of Chase et al. (1993) found shorter trees (Rice et al., 1997), but the genera l picture of angiosperm relationships rem ains unchanged. Signi® cantly, none of the searches in any of these analyses swapped to completion, despite huge investments of computer time: Soltis et al. (1997b) em ployed over 2 years of computer time on the 18S rDNA analysis and Rice et al. (1997) devoted a total of ` àpproxim ately 11.6 months of CPU time’ ’ using three Sun workstations in the reanalysis of the rbcL data set. The three gene trees (representing the plastid and nuclear genomes) are highly similar in the relationships they depict for all major groups of angiosperm s, suggesting that even these rough estimates of phylogeny based on the individual data sets provide a consistent picture of organismal relationships. Hence, these analyses indicate that the phylogenetic analysis of large data sets m ay be more tractable than suggested by earlier simulation studies. Hillis (1996) recently tested the feasibility of analyzing large data sets by simulating the 18S rDNA phylogeny for angiosperm s based on 228 sequences (Soltis et al., 1997b). His simulations suggested that the model phylogeny can be reconstructed using either parsim ony or neighbor-joining methods with . 99% accuracy with only 5,000 bp of sequence data. Initial empirical work (Soltis et al., 1997a) supported these conclusions. Separate and combined data sets for 232 species of angiosperm s for which both rbcL and 18S rDNA sequences were available (over 3,000 bp of sequence data) were analyzed using parsimony methods. A nalysis of the combined data set swapped to completion in a few days; after 1 month, neither of the separate data sets had swapped to completion. In addition, combining the data sets greatly increased the internal support for many clades (as measured by parsim ony jackknife values; Farris et al., 1996). Analyses of the rbcL 1 18S rDNA data set generally resu lt in trees having greater overall resolution than those inferred from the separate data sets and a combination of the well-supp orted c lades presen t in the separate rbcL and 18S rDNA trees. Analysis of the combined data set also recovered several ``uniquely supported’ ’ clades that received jackknife support of $ 50% in the analyses of the combined, but not the separate, data sets. These resu lts are comparable to those observed with combined data sets in more focused studies involving far fewer taxa (e.g., Olmstead and Sweere, 1994; Soltis et al., 1996; Sullivan, 1996). To explore further the feasibilty of analysis of large data sets using parsim ony, we conducted searches on a combined 18S rDNA 1 rbcL 1 atpB data set for 193 species. We also examined the effects of com bining large data sets on internal support by conducting bootstrap analyses on the separate and combined data sets. We ® rst constructed separate data sets of 18S rDNA (1,855 bp), rbcL (1,428 bp), and atpB (1,450 bp) sequences for 193 taxa for which all three sequences were available; these three data sets were then combined into a single data matrix 4,733 bp in length. Represen ted in this matrix are 190 angiosperm s from approxim ately 148 families that represen t well the diversity of angiosperm s; also included are three outgroups, Ephedra sinica, Ginkgo biloba, and Pseudotsuga menziesii (for atpB , Pinus was used in place of Pseudotsuga). Of the 190 angiosperm s included, we used 18S rDNA, rbcL, and atpB sequences for the same genus (and species, if possible) for 137 taxa; in 52 instances, different genera were used as placeholders for a family. In one instance, different families of a sister pair were 34 VOL. 47 SYSTEMATIC BIOLOGY used: Cyperus (Cyperaceae) was used for 18S rDNA and rbcL, whereas Juncus ( Juncaceae) was used for atpB . Most of the 18S rDNA and rbcL sequences are from Soltis et al. (1997b) and Chase et al. (1993), respectively ; other sources of published sequences include Hoot and Crane (1995), Hoot et al. (1995), and Soltis and Soltis (1997). These were supplemented with additional unpublished sequences. We conducted parsim ony searches on the combined and the three separate data sets using PAUP* 4.0 (Swofford, 1997) and Power Macintosh computers. All parsimony searches were conducted as follows. First, 500 replicate heuristic searches with RANDOM taxon addition and NNI branch swapping were conducted, saving ® ve trees per replicate. Using the shortest trees obtained from these initia l searches as starting trees, we then conducted subsequent searches using TBR branch swapping and saving all most parsim onious trees. For the three separate data sets, these subsequent TBR searches were allowed to proceed until the search ``stalled’ ’ on a tree length for 4 days or more and the num ber of trees in memory exceeded 3,500. At this point, we selected a new group of ® ve starting trees one step longer than those used initially, and TBR searches were conducted as before. If the initial NNI searches did not produce trees one step longer, we used the next longest trees. This process was repeated three times for each of the three separate data sets. We almost certainly did not ® nd the shortest trees via this approach. The goal of this study was not to ascertain phylogenetic relationships per se, but rather to compare the perform ance of separate versus combined analyses; thus, trees from the separate parsim ony analy ses are no t presen ted . However, the shortest trees obtained agree closely with those presented elsew here (see reviews by Soltis et al., 1997a; Chase and Albert, 1998). For the combined 18S rDNA 1 rbcL 1 atpB data set, a similar approach was used: After each TBR search, a new group of ® ve starting trees that were one step longer than those used in the previous search was selected for further analysis. Where the initial NNI searches did not produce trees one step longer than those just used, then the next longest NN I trees were selected as starting trees. This process was repeated six times for the combined 18S rDNA 1 rbcL 1 atpB data set. Issues of congruence and whether or

[1]  J. Huelsenbeck,et al.  Hobgoblin of phylogenetics? , 1994, Nature.

[2]  Joseph Felsenstein,et al.  The number of evolutionary trees , 1978 .

[3]  Richard G. Olmstead,et al.  Combining Data in Phylogenetic Systematics: An Empirical Approach Using Three Molecular Data Sets in the Solanaceae , 1994 .

[4]  Junhyong Kim,et al.  GENERAL INCONSISTENCY CONDITIONS FOR MAXIMUM PARSIMONY: EFFECTS OF BRANCH LENGTHS AND INCREASING NUMBERS OF TAXA , 1996 .

[5]  David M. Williams,et al.  Congruence Between Molecular and Morphological Phylogenies , 1993 .

[6]  Douglas E. Soltis,et al.  Phylogenetic Inference in Saxifragaceae Sensu Stricto and Gilia (Polemoniaceae) Using matK Sequences , 1995 .

[7]  D. Hillis Inferring complex phylogenies. , 1996, Nature.

[8]  Junhyong Kim,et al.  Separate Versus Combined Analysis of Phylogenetic Evidence , 1995 .

[9]  D. Soltis,et al.  A Comparison of Angiosperm Phylogenies from Nuclear 18S rDNA and rbcL Sequences , 1995 .

[10]  M. Donoghue,et al.  Analyzing large data sets: rbcL 500 revisited. , 1997, Systematic biology.

[11]  D. Soltis,et al.  MATK AND RBCL GENE SEQUENCE DATA INDICATE THAT SAXIFRAGA (SAXIFRAGACEAE) IS POLYPHYLETIC , 1996 .

[12]  S. B. Hoot,et al.  Inter-familial relationships in the Ranunculidae based on molecular systematics , 1995 .

[13]  M. Chase,et al.  A Perspective on the Contribution of Plastid rbcL DNA Sequences to Angiosperm Phylogenetics , 1998 .

[14]  Phylogenetic relationships of the Lardizabalaceae and Sargentodoxaceae: Chloroplast and nuclear DNA sequence evidence , 1995 .