(Update 20 July 2010): The latest version of USEARCH now supports building very large alignments, so you can try this yourself and the invitation has expired!
I’ve made some good progress on developing algorithms for aligning very large sets of sequences, so I’m inviting you to send me your large sets. I’d like to try the method on more real data. I’ll feed them to the algorithm and send you the alignment.
If you do take me up on this tempting free offer, then please let me know some background on the sequences and what you plan to do with a multiple alignment. I’m generally skeptical about making large alignments, so I’d like to understand better what people out there are trying to do. Also, please see the comments on outlier sequences at the end of this post.
So how does it work? My approach has been to explore ways to exploit the extremely fast search and clustering speeds of USEARCH and UCLUST to help guide a multiple alignment. I started by trying to build a single nearest-neighbor guide tree for the entire set by using USEARCH to find neighbors. This ran into two major difficulties: (1) building a good tree, and (2) growth in alignment length.
1. Building a good tree: It turned out that using USEARCH to find neighbors worked well close to the leaves of the tree, but not for identifying neighbors of larger groups. In retrospect, this is easy to understand because USEARCH (in its current form, at least) is a sequence-sequence search method, and what is really needed is a profile-profile search method.
2. Growth in alignment length: All methods capable of building large alignments are based on the “progressive” strategy, meaning that two profiles are aligned at each internal node of a guide tree. New gaps are introduced each time and every gap adds a new column to the multiple alignment. So the alignment length (=number of columns) usually increases in each iteration, and the number of added columns in the final alignment is roughly a constant x (number of nodes in the tree) = constant x (number of sequences). So with thousands of sequences, you get generally get thousands of columns, most of which are very gappy. Algorithms are sometimes criticized for doing this, but the correct biological alignment should often be expected to be very gappy, assuming that residue homology is your standard of correctness (i.e., two residues should be in the same column iff they are homologous). In variable regions, repeated cycles of insertions, deletions and inversions will tend to produce residues that are not homlogous. This reflects the limitations of the multiple alignment format, in particular its inability to correctly represent evolutionary history. This doesn’t mean that the gaps in an automated alignment are necessarily correct (probably many if not most are wrong), but it is not reasonable to critique, say, a large MUSCLE alignment just because it is very gappy.
A different approach mitigates both these issues. The idea is to split the problem into pieces that are small enough to align by conventional methods (e.g., MUSCLE). Suppose you have 100,000 sequences, but MUSCLE can align only 5,000 sequences. The solution is to divide the 100k sequences into 20 or so subsets of size <= 5k and align each of these. Then pick one representative sequence from each of the 20 subsets and align these to create a “master” alignment. The subset alignments can then be combined into a final alignment by using the master alignment as a guide. This approach could be used to align up to about 5,000^2 = 25M sequences.
We want to align closely related sequences whenever possible, since they can be aligned more accurately (this of course is the principle underlying progressive alignment). So we want to split sequences into subsets such that sequences within a subset are more similar to each other than to members of other subsets. In other words, we have a clustering problem. The simplest and fastest clustering method would be to find a ULCUST threshold such that the maximum cluster size is 5k. This usually won’t work well because often the cluster size varies widely, so if the maximum is set to 5k, there will be many small clusters. A better approach is to perform hierarchical clustering and cut the resulting tree at branches that maximize the subtree size at a value <= 5k. This works as follows: UCLUST is repeatedly applied using a predetermined set of decreasing identity thresholds, say 99%, 98%, 95%, 90%, 80%, 50% and 35%. At each stage, the representative sequences (“seeds”) from each cluster are passed to the next stage. This builds a multi-furcating tree in which each node is a seed sequence from one of these stages, edges from a node point to the members of its cluster. A top-down traversal of the tree can then be used to cut it into pieces of the desired size.
Outlier sequences can mess up a large alignment: every time an outlier is added, the number of errors will tend to increase disproportionately. In large sets, outliers are often caused by poor-quality or anomalous sequences that don’t really belong anyway. A quick and easy way to look for them is to run UCLUST at a relatively low identity, say 80% for nucleotides or 50% for proteins. Typically, you’ll find a few large clusters and many smaller clusters. The small clusters are the outliers: it’s worth trying to build a multiple alignment without sequences in the smaller clusters.