I believe that traditional multiple protein alignment algorithm development has reached a point of diminishing returns. I regard it as an essentially solved problem for practical purposes, and the marginal progress that could be made is impossible to measure. It is — or should be — a “stagnant field”, to quote a recent critic of one of my papers.
Some history first. CLUSTALW is one of the most highly cited methods in science. The publication of BALIBASE in 1999 triggered a benchmark war, stimulated no doubt by the importance of multiple alignment to a wide range of problems in biology, plus the career advantages if your method got a lot of citations. (I am not excepting myself here — MUSCLE is widely known and has been cited about 3,000 times, which has opened up many opportunities for me).
BALIBASE was a reasonable first attempt at a benchmark, and while I believe versions 1 and 2 are not as good as many people believed, they are defensible. However, there are several critical questions that are rarely asked or satisfactorily answered. What is the definition of a correct alignment? How exactly were the benchmark alignments made? How do the authors know they are correct? Can we verify that they are correct?
There are different possible definitions of a correct alignment, and they do not always agree. For example, we can require that homologous residues appear in the same column. In general, this is impossible in a multiple alignment because it is an over-simplification of evolutionary history (see Big alignments–do they make sense?). An alternative is to require that structurally equivalent residues are aligned, but structural equivalence gets fuzzy as structures diverge; there is no unique definition of which individual residues are structurally equivalent. This is shown by the fact that different structural alignment algorithms and different human experts will sometimes disagree about residue correspondences. A better way to think about alignments of very distantly related proteins may be to think at a level of secondary structure elements or folds rather than individual residues. All current benchmarks (BALIBASE, PREFAB, OXBENCH and SABMARK) provide reference alignments and make assessments based on whether individual residues are aligned in agreement with the benchmark. This is reasonable if you take an average over a large number of residues in a large number of alignments, but not if you assume that individual residues in a given reference alignment are all correct. This is a fundamental flaw in some recent assessments that claim to measure specificity (e.g. the FSA paper recently published in Plos Comp Bio). In the case of BALIBASE, there are many examples of structures where homology is unclear: the folds are similar, but experts and databases such as SCOP and CATH do not consider evidence of homology strong enough to place them in the same superfamily. In these cases, it clearly not possible to make reference alignments that are reliable at the level of individual residues. If you can’t be sure that the folds are homologous, you certainly cannot be sure which residues are homologous — maybe none of them are!
BALIBASE v3 is a different case. As I recently showed in my paper “Quality measures for protein alignment benchmarks“, version 3 was mostly aligned by sequence methods, which makes it unsuitable for assessment of multiple alignment programs. Further, it contains egregiously wrong alignments in which non-homologous domains with radically different folds are aligned to each other. The reference alignments in v3 are homologous only in isolated regions surrounded by non-homologous domains. This data grossly violates the assumptions of the global alignment methods like CLUSTALW and MUSCLE that BALIBASE is often used to assess. This makes no sense. Consider, say, mitochondrial chromosomes. These undergo rearrangements, so are not globally alignable even in closely related species. So it wouldn’t make sense to attempt to align them with CLUSTALW. Similarly, BALIBASE v3 is not globally alignable, so it makes no sense to align it by a global method. The position of the BALIBASE authors appears to be, (1) fold-level alignments are hard, therefore they should be in a benchmark, and (2) aligning locally homologous regions in proteins that are not globally alignable is hard, so they should be in a benchmark. That is a reasonable argument, but only if your benchmark provides adequate annotation of the data and defines accuracy measures that are informative for this type of data. If you’re doing structure prediction by homology modeling, fold-level alignments make sense, and are assessed for example in the CASP competition. But BALIBASE claims to have correct residue-level alignments and defines correctness as exact agreement with its residue correspondences, which is nonsense when the folds have uncertain homology. I don’t believe it is informative to test whether MUSCLE can correctly align a locally homologous region in a set of proteins that are not globally alignable. To me, that is a misuse of the method. Reasonable people might disagree, but regardless it is surely important to know that you’re testing a method on data it wasn’t designed to handle.
We should also ask how well benchmark alignments model real biological problems. They typically have a small number of highly diverged sequences, which is not typical.
Differences in accuracy between the better methods as measured by benchmarks are typically rather small — maybe a few percent at most. These differences are probably much smaller than the uncertainties caused by questionable benchmark alignments and the failure of benchmarks to predict performance on your particular problem, which is probably quite different in several important respects — maybe you have thousands of closely related sequences.
I believe the benchmarks wars have been increasingly isolated from practical biological problems over the past few years. Does an improvement of a couple of percent on BALIBASE imply a meaningful improvement in your application? This might be worth investigating, but my guess is no.
I believe the multiple alignment problem is, for all practical purposes, solved. If you’re aligning proteins in the twilight zone, the correct alignment is undefined by structure and unknowable by homology. If identities are higher, then all current methods to pretty well and you might as well use something fast and convenient. I see no reason to invest more effort in trying to improve the benchmark scores of MUSCLE, and I am therefore moving on to other things.
Given the explosion in sequence data, you might argue that creating big alignments is an important problem. But I think this is the wrong approach because you cannot avoid increasing numbers of errors (see Big alignments–do they make sense?), so other approaches are needed. I believe that fast classification methods (search and clustering) are more important, so that’s what I’m working on now with USEARCH and UCLUST.