TY - JOUR
AU1 - Baroni,, Mihaela
AU2 - Semple,, Charles
AU3 - Steel,, Mike
AB - Abstract We describe some new and recent results that allow for the analysis and representation of reticulate evolution by nontree networks. In particular, we (1) present a simple result to show that, despite the presence of reticulation, there is always a well-defined underlying tree that corresponds to those parts of life that do not have a history of reticulation; (2) describe and apply new theory for determining the smallest number of hybridization events required to explain conflicting gene trees; and (3) present a new algorithm to determine whether an arbitrary rooted network can be realized by contemporaneous reticulation events. We illustrate these results with examples. Directed acyclic graph, reticulate evolution, hybrid species, subtree prune and regraft Evolutionary relationships are generally represented by nonreticulating trees, and for certain groups of taxa (e.g., mammals) this model seems well suited. However, for other groups (for example, plants, some fish, and bacteria), processes of reticulate evolution such as the formation of hybrid species, horizontal gene transfer, and other mechanisms (for example, endosymbiosis) suggest that evolutionary history would be better described by a network that is more complex than a tree, with some species arising from the genetic contribution of two (rather than one) ancestral lineages. Although processes of reticulate evolution have long been recognized in biology, techniques for representing and analyzing reticulate evolution have tended to be fairly ad hoc. For example, one might first build a tree and then heuristically add some additional edges if these improve the fit of the data (as in Legendre and Makarenkov, 2002). In the last few years there has been much new theoretical work by computer scientists and mathematicians (e.g., Baroni, 2004; Baroni et al., 2004; Gusfield, 2004; Gusfield et al., 2004; Holland et al., 2004; Huson et al., 2004, 2005; Moret et al., 2004; Song and Hein, 2004) with the aim of providing more rigorous approaches to the representation and analysis of reticulate evolution. In the the third and fourth sections, we provide a brief overview of some of our recent work and show how it can be applied to set lower bounds on the degree of reticulation required to explain two conflicting phylogenetic trees. We illustrate the application of these results on two trees that describe the evolution of alpine Ranunculi in New Zealand. In the fifth section, we present a fast algorithm that determines whether or not a hybrid phylogeny can be realized by hybridization events between species that existed at the same time—an obvious biological requirement, though one that is often overlooked in a formal mathematical representation. The last section contains some concluding remarks. Hybrid Phylogenies In this section we introduce some terminology that is useful for describing and studying hybrid evolution. Informally, a “hybrid phylogeny” is simply a rooted network in which each arc (directed edge) leads from an ancestral taxon to its immediate descendants. However, unlike a rooted phylogenetic tree, we allow for some (ancestral or extant) taxa to have two (or more) incoming arcs. In other words, we regard those taxa as being hybrids, consisting of a genetic composition from both (or all) of the incoming arcs. In this section, we formalize these notions in order to obtain precise results. Furthermore, we describe a tree that underlies any hybrid phylogeny, and provide some background and motivation for the rest of the article. Throughout, the notation and terminology mostly follows Baroni (2004) and Baroni et al. (2004). First we recall some graph-theoretic terminology. Directed graphs (also known as digraphs) are used in evolutionary biology to represent the evolutionary history of extant species. Usually, this representation takes the form of a rooted phylogenetic tree. However, in this article we are mostly interested in representations called (rooted) hybrid phylogenies. A directed graph consists of a collection of nodes and a collection of directed edges called arcs, with each arc joining two nodes. Nodes typically represent species, individuals, or DNA sequences, whereas arcs represent relationships of ancestry. Thus, if u is the “parent” of v, then we denote this relationship with the arc (u,v). The first node indicates where the arc is coming from and the second node indicates where the arc is going to, thus (u,v)≠ (v,u). The degree of a node v is the number of arcs incident with v. In directed graphs, we often distinguish between arcs coming out of a node and those coming into a node. In particular, the outdegree of v is the number of arcs whose first component is v and is denoted d+(v). The indegree of v is the number of arcs whose second component is v and is denoted d−(v). In rooted phylogenies and hybrid phylogenies, the outdegree of a node v is the number of “children” of v, whereas the indegree of v is the number of “parents” of v. A directed path in a digraph is an alternating sequence of nodes and arcs in which ai is an arc from vi−1 to vi for all i, and no node or arc appears more than once. Essentially, a path describes one way in which we can get from one node to another following the direction of the arcs. A directed cycle of a digraph is a directed path in which the first and last nodes are equal. A digraph is acyclic if it has no directed cycles. An acyclic digraph D with no underlying parallel edges (that is, no pair of arcs joining the same two nodes) is rooted if there is a distinguished node ρ, called the root, with the properties that d−(ρ) = 0 and there is a directed path from ρ to every node of D. If D is a rooted digraph, then a rooted subtree of D is any rooted tree that is obtained from D by deleting nodes (and any arcs incident with these nodes) and arcs. We now formally describe rooted phylogenetic trees and hybrid phylogenies. Throughout these definitions, and indeed throughout this article, X will always denote a set of extant species. A rooted phylogenetic tree on X is a rooted tree with no nodes that have both indegree one and outdegree one, whose leaf set is X, and whose root has outdegree at least two. In addition, is binary or fully resolved if all interior nodes have outdegree two. We sometimes refer to X as the label set of and denote it as (⁠⁠). Indeed, for a collection of rooted phylogenetic trees, we denote the union of the label sets of the trees in by (⁠⁠). Two rooted binary phylogenetic trees 1 and 2 are shown in Figure 1. Figure 1 Open in new tabDownload slide A hybrid phylogeny ⁠, and two rooted phylogenetic trees 1 and 2 displayed by ⁠. Figure 1 Open in new tabDownload slide A hybrid phylogeny ⁠, and two rooted phylogenetic trees 1 and 2 displayed by ⁠. A hybrid phylogeny on X is a rooted acyclic digraph in which X is the set of nodes of outdegree zero the root has outdegree at least two, and for all nodes v with d+(v) = 1, we have d−(v) ≥ 2. Nodes of indegree at least two (called hybridization nodes) represent hybridization events. These correspond to an exchange of genetic information between hypothetical ancestors by processes such as horizontal gene transfer, gene fusion, etc. To illustrate, a hybrid phylogeny on X = {a,b,c,d,e} is shown in Figure 1, where the root is the topmost node. The node * as well as the node labeled b are hybridization nodes. Here and in all other figures, it is implicit that arcs are directed downwards. Observe that a rooted phylogenetic tree on X is a particular type of hybrid phylogeny (one that contains no hybridization nodes). Let be a rooted phylogenetic tree on X and let be a hybrid phylogeny on X′, where X⊂eqX′. Then displays if can be obtained from by deleting nodes and edges, and by replacing nodes of indegree one and outdegree one and their incident edges with a single edge (that is, suppressing nodes of indegree one and outdegree one). Extending this to a collection of rooted phylogenetic trees, we say that displays if displays every tree in ⁠. For example, in Figure 1, the hybrid phylogeny displays both 1 and 2 . Biologically speaking, saying that displays means that a gene tree with the topology described by could arise from an evolutionary history depicted by without requiring the action of other processes such as lineage sorting. The concept of display can be generalized to allow refinement of nonbinary trees; however, we do not require this in this article. An Underlying Tree for a Hybrid Phylogeny Processes of reticulate evolution such as the evolution of hybrid species seem to call into question the very existence of any meaningful concept of a tree of life. However, we now describe a simple mathematical result that formalizes how there is always an underlying tree corresponding to those parts of life that do not have a history of reticulation. This result is similar in spirit (though different in detail) to results by Bafna and Bansal (2004), Gusfield (2004), and Huson et al. (2005). Let = (V,E) be a hybrid phylogeny on X with root node ρ. Let VC be the set of nodes of that lie on at least one undirected cycle (that is, a cycle that arises by ignoring the orientation of the arcs). Let VT = (V−VC) ∪{ρ}∪ X. For a node v of V, let c(v) denote the set of species x in X for which there is a directed path from v to x (i.e., c(v) is the extant species for which v is an ancestor, often referred to as a cluster or a clade). To illustrate these concepts, consider the hybrid phylogeny shown in Figure 2a. Here the nodes in VT are solid. Furthermore, c(u) = {a,b,c} and c(z) = {d,e}. Figure 2 Open in new tabDownload slide (a) A hybrid phylogeny and (b) the rooted phylogenetic tree associated with as described in Proposition 1. Figure 2 Open in new tabDownload slide (a) A hybrid phylogeny and (b) the rooted phylogenetic tree associated with as described in Proposition 1. A hierarchy on X is a collection of subsets of X, containing X and all singleton subsets of X, and satisfying the property Observe that the sets in are nested—if they have one or more species in common, then one set is a subset of the other. It is a classical result in phylogenetics that a hierarchy on X is exactly the set of clusters of a rooted phylogenetic X-tree. Given a hybrid phylogeny ⁠, the following result describes a tree that underlies ⁠. Informally speaking, it is the tree obtained by “collapsing” portions of where hybridization has occurred. This has the potential to give rise to trees that are poorly resolved in places. Proposition 1.Let be a hybrid phylogeny on X with node set V. Then the collection = {c(v): v ∈ VT} is a hierarchy on X, in which case there is a rooted phylogenetic X-tree whose set of clusters is ⁠. Proof. The proof is by contradiction. Suppose that {c(v): v ∈ VT} is not a hierarchy. By definition, there exist nodes v1, v2 ∈ VT and elements a,b,x ∈ X such that x ∈ c(v1) ∩ c(v2), a ∈ c(v1)−c(v2), and b ∈ c(v2)−c(v1). Because c(v1) is not a subset of c(v2), there is no directed path in from v2 to v1. Similarly, there is no directed path from v1 to v2. Because x ∈ c(v1) ∩ c(v2), there is a directed path P1 from v1 to x and a directed path P2 from v2 to x. Let v be the first node that is shared by both P1 and P2. Note that such a node exists since x is a node shared by P1 and P2. Because there is no directed path from v1 to v2 or v2 to v1, we know that v≠ v1 and v≠ v2. Similarly, there exist directed paths Qi from ρ to vi (for i = 1,2) and we can let w be the last node that is shared by Q1 and Q2. Again, such a node exists since ρ is shared by both Q1 and Q2. Now if we ignore the direction of the four paths P1, P2, Q1, and Q2, then the path from w to v1 (given by Q1) and w to v2 (given by Q2) and from v1 to v (given by P1) and from v2 to v (given by P2) constitutes an undirected cycle in ⁠, contradicting the assumption that v1, v2 ∈ VT. This completes the proof. For the hybrid phylogeny shown in Figure 2a, the above construction yields the rooted phylogenetic tree shown in Figure 2b. Here in the statement of Proposition 1 is Real-Time Hybrids Maddison (1997) (see also Moret et al., 2004) pointed out an important biological requirement of hybrid phylogenies. Namely, although a hybrid phylogeny might display two trees, there may be no process of hybridization between contemporaneous taxa (either past or present) that can realize this hybrid phylogeny. Nevertheless, by allowing for additional (unsampled, or perhaps extinct) taxa one can resolve this issue without introducing any additional hybridizations. Essentially the role of such an additional taxa is to “carry” a gene (or combination of genes) from the past into some time when it can be inserted into the new hybrid species. Whether these taxa really are (or were) present is another question, but if we are concerned with just placing lower bounds on the degree of hybridization then we can (conservatively) allow them. To illustrate this point, consider Figure 3. Both hybrid phylogenies and ′ display 1 and 2 using two hybridization nodes. However, whereas has a “real-time” realization (see Fig. 4)—a concept that will be formalized in the fifth section, ′ has no such realization. To see the latter, observe that the “parents” of the hybrid species b must coexist in time and the parents of the hybrid species c must also coexist in time. Yet, by considering the ancestor-descendant relationships of these parents, this is not possible. Nevertheless, by allowing another species x that may be either extinct or not yet sampled, one can provide such a realization to ′. This realization is shown as ” in Figure 4. Figure 3 Open in new tabDownload slide Two rooted phylogenetic trees 1 and 2 and two hybrid phylogenies and ′ that display 1 and 2 . Figure 3 Open in new tabDownload slide Two rooted phylogenetic trees 1 and 2 and two hybrid phylogenies and ′ that display 1 and 2 . Figure 4 Open in new tabDownload slide Two hybrid phylogenies that explain the real-time evolutionary histories of 1 and 2 in Figure 3. Figure 4 Open in new tabDownload slide Two hybrid phylogenies that explain the real-time evolutionary histories of 1 and 2 in Figure 3. In the fifth section we present an algorithm for determining whether a given hybrid phylogeny has a real-time realization, or whether additional taxa (as in ” in Fig. 4) might be required. Finding the Minimal Degree of Hybridization A topical question is, What is the smallest number or reticulation events required to explain a set of gene trees? This number sets a lower bound on the degree of reticulation that has occurred in the evolution of the species under consideration. If this initial set of data is a collection of rooted phylogenetic trees, this problem can be interpreted within the framework of hybrid phylogenies as follows. For a hybrid phylogeny with node set V and root ρ, set Note that, as d−(v) is the number of parents of v and every node has exactly one parent if there is no hybridization, d−(v)−1 is the number of “extra parents” that v has. Observe that h(⁠⁠) ≥ 0, and h(⁠⁠) = 0 precisely if is a rooted phylogenetic tree. Extending this definition, the hybrid number of a collection of rooted phylogenetic trees is The value h(⁠⁠) represents the smallest number of hybridization events that are required to explain ⁠. Bordewich and Semple (2005) showed that computing this number is NP-hard even in the simplest case that consists of just two rooted binary phylogenetic trees on the same leaf sets. However, despite this negative result, there are some attractive and useful positive results that have recently been described for computing and bounding h(⁠⁠). We describe these in the next section. The Minimum Number of Hybrid Events Required for Two Trees We begin this section with some further graph-theoretic notation. Let be a rooted binary phylogenetic X-tree and let A be a subset of X. We denote the minimal rooted subtree of that connects the elements in A by (A). Furthermore, we use |A to denote the rooted subtree that is obtained from (A) by suppressing all nodes of indegree one and outdegree one. Now let and ′ be two rooted binary phylogenetic X-trees. We will write h(⁠⁠, ′) to denote h(⁠⁠) for = {⁠, ′}. The first result we describe shows how one can simplify the calculation of h(⁠⁠, ′) when one or more clusters are shared by both and ′. More precisely, suppose that A ⊂ X is a cluster of both and ′ (that is, there is a node of each tree that has A as its set of descendants in X). Let |A and ′|A denote the subtree of and ′ (respectively) that have leaf set A, and let a and a′ be the rooted trees obtained from and ′ (respectively) by replacing the subtree having leaf set A with a new leaf a. Theorem 1.Let and ′ be two rooted binary phylogenetic X-trees. Suppose that A⊂ X is a cluster of both and ′. Then The proof of Theorem 1 is given in Appendix 1. This result is typical of other relationships that can be established by exploiting a description of h(⁠⁠, ′) in terms of what has recently been called a “good-agreement-forest” for the pair and ′ (see Baroni et al., 2005). (“Good” is an overused term, so in this article we will refer to such agreement forests as “acyclic.”) We describe this connection now and provide an application in the next section to show how these results can be used in practice. To make the interpretation work, we regard the root of both and ′ as a node ρ that is adjoined to the original root by a new edge. Furthermore, we view ρ as part of the label sets of both and ′; that is, we view the label sets of and ′ as X∪ {ρ}. For example, consider the two rooted binary phylogenetic trees and ′ shown in the top part of Figure 5. For the purposes of the interpretation, we view and ′ as shown in the bottom part of Figure 5. An agreement forest for and ′ with k+1 components is a collection {ρ,1,2,…,k}, where ρ is a rooted tree whose label set ρ includes ρ and 1,2,…,k are rooted binary phylogenetic trees with label sets 1,2,…,k such that the following properties are satisfied: The label sets ρ, 1, 2,…,k partition X∪ {ρ}. For all i∈ {ρ,1,2,…,k}, i is the same as (isomorphic to) |i and ′|i . The trees in {(⁠i):i∈{ρ,1,2,…,k}} and {′(⁠i):i∈{ρ,1,2,…,k}} are node disjoint rooted subtrees of and ′, respectively. Figure 5 Open in new tabDownload slide Two rooted binary phylogenetic trees and ′ without (above) and with (below) their root labeled ρ. Figure 5 Open in new tabDownload slide Two rooted binary phylogenetic trees and ′ without (above) and with (below) their root labeled ρ. More informally, is an agreement forest for and ′ if, up to suppressing degree-two nodes, can be obtained from each of and ′ by deleting ||−1 edges. As an example, the two forests 1 and 2 shown in Figure 6 are both agreement forests for the two trees and ′ shown in Figure 5. Figure 6 Open in new tabDownload slide Two agreement forests for the two rooted binary phylogenetic trees shown in Figure 5. Figure 6 Open in new tabDownload slide Two agreement forests for the two rooted binary phylogenetic trees shown in Figure 5. It has recently been shown (Bordewich and Semple, 2004) that for any two rooted binary phylogenetic trees and ′ on the same leaf set, the smallest value of k of any agreement forest for and ′ equals the rooted subtree prune and regraft distance between and ′. Denoted drSPR(⁠⁠, ′), this distance is the minimum number of rooted subtree prune and regraft operations required to transform into ′. It is tempting to conjecture that drSPR(⁠⁠,′) and h(⁠⁠,′) are identical, and indeed the former takes the value 1 if and only if the latter does. However, drSPR(⁠⁠,′) is only a lower bound for h(⁠⁠,′), and one can construct pairs of trees and ′ on n species such that drSPR(⁠⁠,′) = 2 yet h(⁠⁠,′) > −1 (Baroni et al., 2005). An agreement forest for and ′ is a maximum-agreement forest if, amongst all agreement forests for and ′, it has the smallest number of components. To continue the previous example, it is straightforward to check that the forest 1 in Figure 6 is a maximum-agreement forest for the two trees and ′ in Figure 5. Thus the rooted subtree prune and regraft distance between these two trees is 2. For the interpretation of h(⁠⁠,′) in terms of agreement forest, we need one further definition. Let = {ρ,1,2,…,k} be an agreement forest for and ′. Let G be the directed graph whose nodes represent the trees in and for which (⁠i,j) is a directed edge from the node representing i to the node representing j precisely if i ≠ j and either the root of the subtree (⁠i) in is an ancestor of the root of the subtree (⁠j) in ⁠, or the root of the subtree ′(⁠i) in ′ is an ancestor of the root of the subtree ′(⁠j) in ′. Because is an agreement forest, the roots of the subtrees (⁠i) and (⁠j) and the roots of the subtrees ′(⁠i) and ′(⁠j) are not the same. We call a acyclic-agreement forest if G is acyclic; that is, if G has no directed cycles. Furthermore, if over all acyclic-agreement forests for and ′, contains the smallest number of components, then is a maximum-acyclic-agreement forest for and ′, in which case we denote this value of k by mg(⁠⁠,′). Observe that mg(⁠⁠,′) = 0 if and only if, up to isomorphism, and ′ are identical. The forest 2 in Figure 6 is a acyclic-agreement forest for the two trees and ′ in Figure 5. Indeed, this forest is a maximum-acyclic-agreement forest for and ′. To see that 1 is not a acyclic-agreement forest for and ′, observe that G1 contains a directed cycle (see Fig. 7, where the nodes are drawn as large circles enclosing the trees they represent). Figure 7 Open in new tabDownload slide The graph G1. Figure 7 Open in new tabDownload slide The graph G1. The interpretation of the hybrid number of two rooted binary phylogenetic trees on the same label sets in terms of agreement forests is stated in following theorem which is established by Baroni et al. (2005). Theorem 2.Let and ′ be two rooted binary phylogenetic X-trees. Then For example, it follows from Theorem 2 that the value of h(⁠⁠,′) for the two trees in Figure 5 is 3. We mentioned previously that computing h(⁠⁠,′) is NP-hard. The reason for this is that finding a maximum-acyclic-agreement forest for and ′ is NP-hard. Currently, the best known method for finding such a forest is trial and error. However, if one has an acyclic-agreement forest (not necessarily maximum) for and ′, then there is a simple algorithm using for constructing a hybrid phylogeny that displays both and ′. This algorithm is provided by the inductive proof of Theorem 2 in Baroni et al. (2005) and is given below. There is a simple, fast, and well-known way of deciding whether or not a directed graph D is acyclic. Find a node, v1 say, that has indegree zero. If there is no such node, then D contains a directed cycle. Now delete v1 (and all arcs incident with v1) from D, and find a node, v2 say, that has degree zero. Again, if there is no such node, D contains a directed cycle. Deleting v2 and continuing in this way, we eventually find that D is not acyclic or obtain an ordering of the nodes, v1,v2,…,vn say of D, so that for all i∈ {1,2,…,n}, the node vi has indegree zero in the digraph obtained from D by deleting the nodes v1,v2,…,vi−1 and all edges incident with these nodes. This ordering implies that D is acyclic (see Lemma 1 in Appendix 1). Consequently, we will call such an ordering an acyclic ordering of D. We remark here that this process is formally incorporated in the algorithm given in the fifth section. The algorithm for constructing a hybrid phylogeny from an acyclic-agreement forest is as follows. Note that, in any acyclic ordering of G⁠, the node ρ always appears first. Algorithm:Algorithm: HybridPhylogeny (⁠⁠) Input: An acyclic-agreement forest for two rooted binary phylogenetic X-trees and ′ with k+1 components. Output: A hybrid phylogeny that displays both and ′ in which the number of hybridization nodes is k. Find an acyclic ordering, ρ, 1, 2, …, k say, of G⁠. Set 0 = ρ and set i = 1. Attach i to i−1 via two new edges. Each of these edges join the root of i to some (not necessarily distinct) edge of i−1. These edges are added so that the resulting hybrid phylogeny displays |({ρ, 1, …, i}) and ′|({ρ, 1, …, i}). Set i to be the resulting hybrid phylogeny, and return i if i = k. Increment i by 1 and go to Step 3. Application In this section, we apply the theory of the last section to two phylogenetic trees on 46 sequences of alpine Rununculi of New Zealand, reported by Lockhart et al. (2001). The first tree was constructed from nuclear ITS sequences, whereas the second was constructed from chloroplast (JSA) sequences (for details see Lockhart et al., 2001). The two trees showed considerable agreement; however, there was also a fair degree of incompatibility. One possible explanation for this incompatibility is the occurrence of hybrid evolution, whereby the nuclear ITS sequence has a different history to the chloroplast (JSA) sequences. Of course, there may be other sources of phylogenetic error (sampling effects such as noise, model misspecification, lineage sorting) that could cause the two trees to conflict, even in the absence of any hybrid evolution. Nevertheless, we can still ask the following question: Assuming the two trees correctly describe the history of the two genes, and their incongruence is due to hybrid evolution, what is the smallest number of hybrid events required to explain this? The study is complicated slightly by the fact that neither tree is binary. In this case, we took a conservative approach and allowed nonbinary subtrees to be resolved in any way that helped minimize the required number of hybridization events. Also, for the sake of illustration in this article, we will restrict attention to a subgroup (“Group I”) of the sequences consisting of 20 sequences. This group is a candidate for reticulate evolution, since the F1 progeny of hybrid origin are known to be fertile (Fisher, 1965). The two trees for these 20 sequences are shown in Figure 8, with 1 the nuclear, and 2 the chloroplast tree. Figure 8 Open in new tabDownload slide The tree 1 for nuclear ITS sequences and 2 for chloroplast JSA sequences from Lockhart et al. (2001) restricted to Group I. Figure 8 Open in new tabDownload slide The tree 1 for nuclear ITS sequences and 2 for chloroplast JSA sequences from Lockhart et al. (2001) restricted to Group I. For 1 and 2, one can identify five clusters (denoted l1 to l5 in Fig. 8) shared by these two trees; this allows us to apply Theorem 1. In this way we reduce the problem from comparing two 20-taxon trees to one of comparing two 5-taxon trees (each having leaf set l1, …, l5), together with the trees on the shared clusters (in fact, these latter trees do not contribute to the h score, because all these pairs of cluster subtrees are compatible). Now using Theorem 2 one can show using a detailed case analysis that h(⁠1, 2) = 3. Figure 9 shows one hybrid phylogeny (⁠⁠) that displays the five clusters shared by 1 and 2 with three hybrid events. Note that this is not the only such phylogeny. Similarly, for the full set of 46 sequences it can be shown (by hand) that the h value lies between 7 and 12 (Baroni, 2004). Thus, assuming the trees are correct, we require at least 3 hybrid events to describe the evolution of the Group I sequences, and at least 7 hybrid events to describe the evolution of the entire group of 46 sequences. We should stress that this analysis is to illustrate the techniques, rather than to formally show that there has been this degree of hybrid evolution in the taxa described—as we mentioned, there are other reasons why trees may disagree, and these need to be considered (these other processes often leave different statistical signatures from hybridization; see Holder et al., 2001; Huson et al., 2005). Figure 9 Open in new tabDownload slide Two hybrid phylogenies that display 1 and 2 and requiring three hybridization events (the fewest possible for these two trees). Figure 9 Open in new tabDownload slide Two hybrid phylogenies that display 1 and 2 and requiring three hybridization events (the fewest possible for these two trees). Using an argument similar to that used to show that ′ in Figure 3 has no real-time realization (in the sense described in earlier), it is easily checked that the hybrid phylogeny shown in Figure 9 also has no real-time realization. However the hybrid phylogeny ′ in Figure 9 allows for a real-time hybrid evolution scenario, with just two extra taxa y1 and y2. Although the analysis of deciding a real-time realization could be resolved for this small-scale example by an ad hoc case analysis, it is clear that such a task could be complicated for a large and complex hybrid phylogeny. In the next section, we present an algorithm to determine whether an arbitrary hybrid phylogeny can be realized by hybrid evolution between contemporaneous ancestral taxa. An Algorithm for Real-Time Hybrids The concept of a real-time hybrid has been briefly and informally mentioned already; now we formalize this notion and provide an algorithm to determine whether an arbitrary hybrid phylogeny can be realized in this way. Some of the more technical parts of this section have been moved to Appendix 1 to assist readability. Let be a hybrid phylogeny with node set V and arc set A. We say that has a temporal representation if there exists a map f:V →N = {0,1,2,…,} with the following properties: If v is a node of with d−(v) = 1, then f(u) < f(v) for the (only one) immediate ancestor u of v. If v is a node of with d−(v) ≥ 2, then f(u) = f(v) for all immediate ancestors u of v. Such a map is a called a temporal labeling of ⁠. To illustrate, a temporal labeling of a hybrid phylogeny is shown in Figure 10, where, for each node, the first element is the node and the second element is the element of NN assigned under the temporal representation f. All rooted phylogenetic trees have a temporal representation. However, not all hybrid phylogenies have such a representation. For example, the hybrid phylogeny shown in Figure 11a, which has the same shape as ′ shown in Figure 3, has no temporal representation. Figure 10 Open in new tabDownload slide A temporal labeling of a hybrid. Figure 10 Open in new tabDownload slide A temporal labeling of a hybrid. Figure 11 Open in new tabDownload slide (a) A hybrid phylogeny 2 with no temporal representation and (b) its associated digraph D2. Figure 11 Open in new tabDownload slide (a) A hybrid phylogeny 2 with no temporal representation and (b) its associated digraph D2. The main result of this section (Theorem 3) is to characterize exactly when an arbitrary hybrid phylogeny has a temporal representation. To this end, we next describe a particular digraph D associated with a fixed hybrid phylogeny with node set V. This graph is not designed to depict the evolutionary relationships, instead it summarizes properties of ⁠. The node set for this new graph will be denoted [V] and will consist of nodes [v] that represent either a single node in ⁠, or a subset of nodes in that must have been contemporaneous (because they are nodes involved in the same hybridization event, as parental species or as the child species). In particular, let V and A be the node and arc sets of ⁠, respectively. Let and Any arc in AT is called a tree arc and any arc in AH is called a hybridization arc. Note that the sets AT and AH partition A. Ignoring the direction of the arcs of ⁠, an equivalence relation on V is now defined by setting there is a path of hybridization arcs from u to v in ⁠. Observe that if v is not incident with a hybridization arc, then [v] = {v}. Set We describe our associated digraph D as follows. The node set of D is [V], and [u] and [v] are joined by an arc ([u],[v]) if there exists a∈ [u] and b∈ [v] such that (a,b) is a tree arc in A. It is easily seen that D is connected. To illustrate, consider Figures 11 and 12. Figure 11b shows the digraph D2, where 2 is shown in Figure 11a with Furthermore, for the hybrid phylogeny 1 shown in Figure 12a, the digraph D1 is shown in Figure 12b. To provide some intuition for D⁠, we note that the arcs in D represent the direction of time. Thus a directed cycle means that a descendant species is older than its ancestors, which is not possible. Figure 12 Open in new tabDownload slide (a) A hybrid phylogeny 1 and (b) its associated digraph D1. Figure 12 Open in new tabDownload slide (a) A hybrid phylogeny 1 and (b) its associated digraph D1. Let be a hybrid phylogeny and suppose that f:V→ is a temporal labeling of ⁠. Let be the map from [V] to NN that is defined by setting ([v]) = f(v) for all v∈ V. To see that this map is well defined, first observe that if [u] = [v], then there is an (undirected) path from u to v consisting of hybridization arcs. Because the end nodes of any arc on this path are assigned the same natural number under f, it follows that all nodes in this path are assigned the same natural number under f. Hence, for all w, w′∈ [v], we have f(w) = f(w′). Thus is well defined. Moreover, as f is a temporal labelling of ⁠, there is no u and v in the same equivalence class such that (u,v) is a tree arc. The following result provides a concise characterization for when a hybrid phylogeny has a temporal representation; its proof is given in Appendix 1. Theorem 3.A hybrid phylogeny has a temporal representation if and only if D is acyclic. Theorem 3 is the basis for a polynomial-time algorithm (TempRep) for determining whether or not a hybrid phylogeny has a temporal representation and, if so, providing a temporal labeling. Algorithm: TempRep (⁠⁠) Input: A hybrid phylogeny with node set V. Output: A temporal labelling f of or the statement has no temporal representation. Construct D⁠. Set i = 0 and D0 = D⁠. Choose Si to be any non-empty set of nodes of Di that have indegree zero. If there are no such nodes, then halt and return has no temporal representation. Set Di+1 to the digraph resulting from Di by deleting the nodes Si and all arcs incident with these nodes. If Di+1 is the empty graph, then go to Step 5. Otherwise, increment i by 1 and go to Step 3. Define f:V→ by setting f(v) = i for all v∈ V, where [v]∈ Si. Return the map f. The correctness of this algorithm is guaranteed by the following result, whose proof is given in Appendix 1. Theorem 4.Let be a hybrid, and suppose that TempRep is applied to ⁠. If has a temporal representation, then TempRep returns a temporal labelling of ⁠. If has no temporal representation, then TempRep returns the statement has no temporal representation. Moreover, the running time of TempRep is quadratic in the size of the node set of . For example, if one takes the hybrid phylogeny 1 in Figure 12a and apply the algorithm TempRep, we can reconstruct the temporal representation shown in Figure 10. Note that there is some choice as to the assignment of numbers for the leaves a and d. Such choices will generally arise for any hybrid phylogeny that has a temporal representation. Observe that it is the relative ordering of the nodes and not the actual values assigned by a temporal labeling that is important. We can make this idea more precise as follows. Let be a hybrid phylogeny with node set V that has a temporal representation, and let f1 and f2 be two temporal temporal labelings of ⁠. We say that f1 and f2 are ordering isomorphic if, for all u,v∈ V, the following hold: f1(u) < f1(v) if and only if f2(u) < f2(v); f1(u) = f1(v) if and only if f2(u) = f2(v). Using the results in this section (and Appendix 1) one can construct an algorithm that lists, up to ordering isomorphism, all temporal labelings of so that each such labeling is outputted in polynomial time. An outline of this algorithm is given in Appendix 1. It is important to note that, as this list may be exponential in the size of V, the algorithm itself is not guaranteed to run in polynomial time. We end this section by noting that, although a hybrid phylogeny may have a temporal labeling, this does not mean that unsampled lineages could not have been involved in the event. Concluding Remarks The reconstruction and analysis of hybrid phylogenies gives rise to many challenging mathematical and computational problems. We have described some results that can help set lower bounds on the extent of hybridization required to explain the conflict between two phylogenetic trees. This is currently an active area of research in bioinformatics (see, e.g., Huydn et al., 2005; Huson et al., 2005). Ultimately, statistical questions will also need to be addressed—for example, how can one use differing bootstrap (or Bayesian posterior probability) support values for different trees to quantify and distinguish genuine reticulate evolution from other phenomena (e.g., lineage sorting) that can give rise to conflicting phylogenies? In the classical phylogenetic analysis on trees, a combinatorial analysis often lays the foundation for later statistical approaches (for example, Peter Buneman's work in the early 1970s concerning the four-point condition provided a basis for now widely used distance-based approaches in phylogenetics such as neighbor-joining with model-corrected distances). Combinatorial insights into hybrid phylogenies are likely also to help in developing statistically-based approaches to the study of reticulate evolution. We have also explored the issue of determining whether a hybrid phylogeny has a real-time realization, and provided a simple characterization (and algorithm) for this task. This algorithm runs in polynomial time; and a naive implementation would allow a running time that is quadratic in the number of nodes, though it is possible that a more clever implementation could improve this. Lastly, in general, a hybrid phylogeny on X that displays a collection of rooted binary phylogenetic X-trees need not be unique. Deciding whether there exists such a hybrid phylogeny is an interesting question and one that may have an attractive combinatorial solution. Acknowledgments We thank the New Zealand Marsden Fund (UOC-MIS-005) and the Allan Wilson Centre for Molecular Ecology and Evolution for supporting this research. We thank Peter Lockhart for encouragement to develop some of the theory in this article, and for helpful subsequent comments. We also thank the referees for their helpful and constructive comments. A special thank you to Mark Holder (referee) for his detailed comments. References Bafna V. , Bansal V. . 2004 . The number of recombination events in a sample history: Conflict graph and lower bounds . IEEE/ACM Trans. Comput. Biol. Bioinf. 1 : 78 – 90 . Google Scholar Crossref Search ADS WorldCat Bang-Jensen J. , Gutin G. . 2001 . Digraphs: Theory, algorithms and applications London . Springer-Verlag . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Baroni M. . 2004 . Hybrid phylogenies: A graph-based approach to represent reticulate evolution Christchurch, New Zealand . University of Canterbury , PhD thesis . Baroni M. , Grünewald S. , Moulton V. , Semple C. . 2005 . Bounding the number of hybridisation events for a consistent evolutionary history . J. Math. Biol. 51 : 171 – 182 . Google Scholar Crossref Search ADS PubMed WorldCat Baroni M. , Semple C. , Steel M. A. . 2004 . A framework for representing reticulate evolution . Ann. Combin. 8 : 391 – 408 . Google Scholar Crossref Search ADS WorldCat Bordewich M. , Semple C. . 2004 . On the computational complexity of the rooted subtree prune and regraft distance . Ann. Combin. 8 : 409 – 423 . Google Scholar Crossref Search ADS WorldCat Bordewich M. , Semple C. . 2005 . Computing the minimum number of hybridisation events for a consistent evolutionary history Christchurch, New Zealand . Department of Mathematics and Statistics, University of Canterbury , Submitted manuscript (Departmental report UCDMS2004/21) . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Fisher F. J.F. . 1965 . The alpine Ranunculi of New Zealand New Zealand . DSIR publishing . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Gusfield D. . 2004 . A fundamental decomposition theory for phylogenetic networks and incompatible characters . Technical Report UCDavis CSE-2004-32 Gusfield D. , Eddhu S. , Langley C. . 2004 . Optimal, efficient reconstruction of phylogenetic networks with constrained recombination . J. Bioinf. Comput. Biol. 2 : 173 – 213 . Google Scholar Crossref Search ADS WorldCat Holder M. T. , Anderson J. A. , Holloway A. K. . 2001 . Difficulties in detecting hybridization . Syst. Biol. 50 : 978 – 982 . Google Scholar Crossref Search ADS PubMed WorldCat Holland B. , Huber K. , Moulton V. , Lockhart P. J. . 2004 . Using consensus networks to visualize contradictory evidence for species phylogeny . Mol. Biol. Evol. 21 : 1459 – 1461 . Google Scholar Crossref Search ADS PubMed WorldCat Huson D. H. , Dezulian T. , Kloepper T. , Steel M. A. . 2004 . Phylogenetic super-networks from partial trees . IEEE Trans. Comput. Biol. Bioinf. 1 : 151 – 158 . Google Scholar Crossref Search ADS WorldCat Huson D. H. , Kloepper T. , Lockhart P. J. , Steel M. A. . Miyano S. . Reconstruction of reticulate networks from gene trees . Proceedings of RECOMB 2005 . Berlin . Springer-Verlag , 233 – 249 . (LNBI 3500) . Huydn T. N.D. , Jansson J. , Nguyen N. B. , Sung W.-K. . Miyano S. . Constructing a smallest refining galled phylogenetic network . Proceedings of RECOMB 2005 . Berlin . Springer-Verlag , 265 – 280 . LNBI 3500 . Legendre P. , Makarenkov V. . 2002 . Reconstruction of biogeographic and evolutionary networks using reticulograms . Syst. Biol. 51 : 199 – 216 . Google Scholar Crossref Search ADS PubMed WorldCat Lockhart P. J. , McLenachan P. A. , Havell D. , Glenny D. , Huson D. , Jensen U. . 2001 . Phylogeny, radiation, and transoceanic dispersal of New Zealand alpine buttercups: Molecular evidence under split decomposition . Ann. Missouri Bot. Gard. 88 : 458 – 477 . Google Scholar Crossref Search ADS WorldCat Maddison W. P. . 1997 . Gene trees in species trees . Syst. Biol. 46 : 523 – 536 . Google Scholar Crossref Search ADS WorldCat Moret B. M. E. , Nakhleh L. , Warnow T. , Linder C. R. , Tholse A. , Padolina A. , Sun J. , Timme R. . 2004 . Phylogenetic networks: modeling, reconstructibility, and accuracy . IEEE/ACM Trans. Comput. Biol. Bioinf. 1 : 1 – 11 . Google Scholar Crossref Search ADS WorldCat Song Y. , Hein J. . 2004 . On the minimum number of recombination events in the evolutionary history of DNA sequences . J. Math. Biol. 48 : 160 – 186 . Google Scholar Crossref Search ADS PubMed WorldCat Appendix 1 Proof of Theorem 1 It is clear that the inequality holds if A = X. Therefore, we may assume that A≠ X. We first show that Let A be a maximum-acyclic-agreement forest for |A and ′|A, and let a be a maximum-acyclic-agreement forest for |a and ′|a. Let i,a be the unique tree in a with a node labeled a, and let ρ,A be the unique tree in A with a node labeled ρ. Let A,a be the tree obtained by adjoining ρ,A to i,a via an edge joining ρ and a, removing the labels ρ and a, and then suppressing any degree-two nodes. Because of the acyclic conditions on A and a, we have that is an acyclic-agreement forest for and ′ with || = |A|+|a|−1. It now follows by Theorem 2 that This establishes (1). We next show that Let be a maximum-acyclic-agreement forest for and ′. There are two cases to consider: there exists i∈ such that (⁠i)∩ A≠ and (⁠i)∩ ((X−A)∪{ ρ})≠ ⁠, and for all i∈ ⁠, either (⁠i)⊂eqA or (⁠i)⊂eq ((X−A)∪ {ρ}). Case (i). Assume that i is a such a tree in ⁠. Then the minimal subtree of (and ′) that contains the label set of i includes the root of |A (and ′|A). Because is an agreement forest, this implies that i is the unique tree in with the properties described in (i). Let x∈ (⁠i)∩ A, and let i,a be the tree obtained from i| ((X−A)∪ {ρ }∪ {x}) by relabeling x as a. Furthermore, let i,A be the tree obtained from i|A by adding ρ at the end of a pendant edge adjoined to the root of i|A. Then, as is an acyclic-agreement forest for and ′, is an acyclic-agreement forest for |A and ′|A, and is an acyclic-agreement forest for a and ′a. Since || = |A|+|a|−1, we have that This establishes (2) for (i). Case (ii). Because does not contain any directed cycles, it follows that the sub-digraph of induced by the set {i ∈ :(⁠i)⊂eqA} does not contain any directed cycles. This means that this sub-digraph has a node, 0 say, of indegree zero. Let 0,ρ be the tree obtained from 0 by adding ρ at the end of a pendant edge adjoined to the root of 0. Since is an acyclic-agreement forest for and ′, it is easily seen that is an acyclic-agreement forest for |A and ′|A, and is an acyclic-agreement forest for a and ′a, where a is used denote the tree consisting of a single node labelled a. Thus || = |A|+|a|−1, and so, by Theorem 2, This establishes (2) for (ii). Combining (1) and (2) completes the proof of the theorem. Proof of Theorem 3 Let D be a digraph with node set V and arc set A, and suppose that D is acyclic. In an earlier section, we described the concept of an acyclic ordering of D. It is easily seen that this is equivalent to there being a map g:V→ such that, for all (u,v)∈ A, we have g(u) < g(v). Such a map g will prove useful in proving Theorem 3. The following lemma is well known and easily proved (for example, see Bang-Jensen and Guitin, 2001). Lemma 1.A digraph is acyclic if and only if it has an acyclic ordering. Proposition 2.Let be a hybrid phylogeny with node set V and suppose that f:V→ is a temporal labeling of ⁠. Then induces an acyclic ordering of [V]. In particular, D is acyclic. Proof. Let f:V→ be a temporal labeling of ⁠, and consider D⁠. Let ([u],[v]) be an arc of D⁠. To prove the proposition it suffices to show by Lemma 1 that ([u]) < ([v]). Now, by definition, there exist elements a∈ [u] and b∈ [v] such that (a,b) is a tree arc of ⁠. Because f is a temporal labeling of ⁠, we have that f(a) < f(b), which in turn implies that ([u]) < ([v]) as required. Proposition 3.Let be a hybrid phylogeny with node set V, and suppose that D is acyclic. Let g be an acyclic ordering of [V]. Let f be the map from V into NN defined by setting f(v) = g([v]). Then f is a temporal labeling of ⁠. 3. Proof. Let (u,v) be an arc of ⁠. First assume that (u,v) is a tree arc. Then u and v are in different equivalence classes; otherwise, D contains a loop contradicting the fact that D is acyclic. Furthermore, there is an arc from [u] to [v] in D⁠. It now follows that f(u) < f(v). Now assume that (u,v) is a hybridization arc of ⁠. Then [u] = [v], and so f(u) = f(v). Hence, by definition, f is a temporal labeling of ⁠. Combining Propositions 2 and 3, we obtain Theorem 3. Proof of Theorem 4 To see that TempRep does indeed work, we begin with the following well-known and easily proved lemma. Lemma 2.Let D be a digraph that contains no directed cycle. Then there exists a node of D whose indegree is zero. To prove part (i) of Theorem 4, suppose that has a temporal representation. Then, by Theorem 3, D has no directed cycles. By Lemma 2, this implies that every subdigraph obtained from D by deleting nodes (and their incident arcs) contains at least one node of indegree zero. It now follows that TempRep applied to returns a map f:V→ ⁠. To see that f is a temporal labeling of ⁠, define g:[V]→ by setting g([v]) = Si, where [v]∈ Si. Because of the way in which S0, S1, S2, … are constructed, g is an acyclic ordering of the nodes of D⁠. Therefore, by Proposition 3, the map f is a temporal labeling of ⁠. For the proof of part (ii) of Theorem 4 suppose that has no temporal representation. Then, by Theorem 3, D contains a directed cycle. Let {[v1],[v2],…,[vk]} be the nodes in this cycle, where we may assume that ([vj],[vj+1]) for all j and ([vk],[v1]) are arcs of this cycle. It is now easily seen that beginning with D⁠, and selecting and removing only nodes with indegree zero none of the nodes in this cycle can ever be removed. Thus at some iteration i of TempRep when applied to ⁠, no node of Di has indegree zero, in which case TempRep halts and returns has no temporal representation. This completes the proof of (ii). We leave the straightforward check that the running time of TempRep applied to is quadratic in the size of the node set of to the reader. Outline of An Algorithm to Output All Temporal Labelings of a Hybrid Phylogeny, Up to Order Isomorphism By Proposition 2, each temporal labeling of induces an acyclic ordering of the node set [V] of D⁠. Conversely, by Proposition 3, each acyclic ordering of [V] induces a temporal labeling of ⁠. It follows that if has a temporal representation, then all temporal labelings of can be found by finding all acyclic orderings of [V]. Using the first part of the proof of Theorem 3, it is easily checked that all such orderings can be obtained by considering all possible ways of reducing D to the empty graph by sequentially selecting and then deleting subsets of nodes of indegree zero. Because it is only the relative ordering of the nodes of that are of interest, it follows that it is only the order in which these subsets are chosen that is important. Each possible way of reducing D to the empty graph gives rise to a unique sequence of chosen subsets of nodes of D⁠. In TempRep, this corresponds to all possible choices for the sequence S0, S1, S2,…. Furthermore, each such sequence induces, up to ordering isomorphism, a unique temporal labeling of ⁠. Hence to list, up to ordering isomorphism, all temporal labelings of one simply needs to systematically find all possible choices for selecting S0, S1, S2,… in TempRep. © 2006 Society of Systematic Biologists This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
TI - Hybrids in Real Time
JF - Systematic Biology
DO - 10.1080/10635150500431197
DA - 2006-02-01
UR - https://www.deepdyve.com/lp/oxford-university-press/hybrids-in-real-time-pGN8JoHN00
SP - 46
EP - 56
VL - 55
IS - 1
DP - DeepDyve
ER -