I’m working on building trees from large sequence sets. To start with, I want to make guide trees for MUSCLE; long-term I want to make estimates of phylogeny. One approach is to USEARCH as a subroutine that finds the nearest neighbors of a sequence or a group of sequences.
Short-term, I want to make guide trees for MUSCLE for large sequence sets. Building the guide tree is currently the most expensive step in both time and memory, so building a tree more efficiently will allow much larger sets to be aligned.
I’m tackling this first because (i) MUSCLE doesn’t care (much) about branch lengths, so this simplifies the problem, and (ii) I often get requests for people wanting to align large numbers of sequences (usually not recommended; see this post).
I’ve implemented an algorithm I call “middle-out” hierarchical clustering (as opposed to top-down or bottom-up). It works like this. Set a target number of sequences that are tractable for all-vs-all distance clustering by UPGMA, say 1000. Start by using UCLUST at some medium identity, say 80% for nucleotides. If you have too many seeds, repeatedly cluster at lower identities until you reduce the number of seeds to the desired target. Then, for each cluster, repeatedly cluster at higher identities until you reduce the number of seeds to the desired target. Conceptually, this builds the middle of the tree, the top and bottom levels are done by a more careful method.
This builds a tree, but now the question is how good is it? One approach is to compare with a phylogenetic tree built by a more robust method, say FastTree. This has a couple of problems. In general, the optimal guide tree is not the same as the phylogenetic tree (see the second MUSCLE paper for the reasoning). Also, the metrics used for tree comparison are usually topological, i.e. consider only the branching order, ignoring branch lengths. At first glance, this is good because MUSCLE doesn’t care about branch lengths, but…
Topological metrics like Robinson-Foulds don’t distinguish cases where branching order is important from cases where it is not. Suppose e.g. there are three identical sequences. The branching order is then arbitrary, and then R-F can ding you for no good reason.
This could be fixed by using a metric that is aware of branch lengths. But MUSCLE only uses the branching order, so a tree with good topology and bad branch lengths would get dinged for no good reason.
What attributes of the tree do we really care about? To be a good guide tree, it should bring more closely related sequences together before more distantly related sequences. I want a ‘neighborliness’ measure to capture this. Here is one way it could be done. For each node, measure the mean pair-wise identity of sequences in its two subtrees. Then average over all nodes. I claim that higher average = better guide tree.
For algorithm development purposes, it is helpful to know if a sequence is added at a sub-optimal time in the construction of the multiple alignment. This can be done using a similar approach. For each sequence, measure its average pair-wise identity to the opposite subtree at the point where it is added. For example, if an internal node of the guide tree joins (A,B) to (C,D,E), then for sequence A compute the average pair-wise identity of AB, AD and AE (which should be computed independently of the multiple alignment). Now we can compare the quality of two trees T1 and T2 from the point of view of each sequence. Let Q(T1,A) be the average identity I just defined. Then, if Q(T1,A) >> Q(T2,A), we can say that T2 has a bad topological placement of A, and this allows us to look more closely at the tree-building related to sequence A. So a handy report would display the differences |Q(T1,S) – Q(T2,S)| for all sequences S sorted by magnitude. Big differences identify cases where one tree did a bad job of the topology for S.