Multiple protein alignment is a dead field

I believe that traditional multiple protein alignment algorithm development has reached a point of diminishing returns. I regard it as an essentially solved problem for practical purposes, and the marginal progress that could be made is impossible to measure. It is — or should be — a “stagnant field”, to quote a recent critic of one of my papers.

Some history first. CLUSTALW is one of the most highly cited methods in science. The publication of BALIBASE in 1999 triggered a benchmark war, stimulated no doubt by the importance of multiple alignment to a wide range of problems in biology, plus the career advantages if your method got a lot of citations. (I am not excepting myself here — MUSCLE is widely known and has been cited about 3,000 times, which has opened up many opportunities for me).

BALIBASE was a reasonable first attempt at a benchmark, and while I believe versions 1 and 2 are not as good as many people believed, they are defensible. However, there are several critical questions that are rarely asked or satisfactorily answered. What is the definition of a correct alignment? How exactly were the benchmark alignments made? How do the authors know they are correct? Can we verify that they are correct?

There are different possible definitions of a correct alignment, and they do not always agree. For example, we can require that homologous residues appear in the same column. In general, this is impossible in a multiple alignment because it is an over-simplification of evolutionary history (see Big alignments–do they make sense?). An alternative is to require that structurally equivalent residues are aligned, but structural equivalence gets fuzzy as structures diverge; there is no unique definition of which individual residues are structurally equivalent. This is shown by the fact that different structural alignment algorithms and different human experts will sometimes disagree about residue correspondences. A better way to think about alignments of very distantly related proteins may be to think at a level of secondary structure elements or folds rather than individual residues. All current benchmarks (BALIBASE, PREFAB, OXBENCH and SABMARK) provide reference alignments and make assessments based on whether individual residues are aligned in agreement with the benchmark. This is reasonable if you take an average over a large number of residues in a large number of alignments, but not if you assume that individual residues in a given reference alignment are all correct. This is a fundamental flaw in some recent assessments that claim to measure specificity (e.g. the FSA paper recently published in Plos Comp Bio). In the case of BALIBASE, there are many examples of structures where homology is unclear: the folds are similar, but experts and databases such as SCOP and CATH do not consider evidence of homology strong enough to place them in the same superfamily. In these cases, it clearly not possible to make reference alignments that are reliable at the level of individual residues. If you can’t be sure that the folds are homologous, you certainly cannot be sure which residues are homologous — maybe none of them are!

BALIBASE v3 is a different case. As I recently showed in my paper “Quality measures for protein alignment benchmarks“, version 3 was mostly aligned by sequence methods, which makes it unsuitable for assessment of multiple alignment programs. Further, it contains egregiously wrong alignments in which non-homologous domains with radically different folds are aligned to each other. The reference alignments in v3 are homologous only in isolated regions surrounded by non-homologous domains. This data grossly violates the assumptions of the global alignment methods like CLUSTALW and MUSCLE that BALIBASE is often used to assess. This makes no sense. Consider, say, mitochondrial chromosomes. These undergo rearrangements, so are not globally alignable even in closely related species. So it wouldn’t make sense to attempt to align them with CLUSTALW. Similarly, BALIBASE v3 is not globally alignable, so it makes no sense to align it by a global method. The position of the BALIBASE authors appears to be, (1) fold-level alignments are hard, therefore they should be in a benchmark, and (2) aligning locally homologous regions in proteins that are not globally alignable is hard, so they should be in a benchmark. That is a reasonable argument, but only if your benchmark provides adequate annotation of the data and defines accuracy measures that are informative for this type of data. If you’re doing structure prediction by homology modeling, fold-level alignments make sense, and are assessed for example in the CASP competition. ┬áBut BALIBASE claims to have correct residue-level alignments and defines correctness as exact agreement with its residue correspondences, which is nonsense when the folds have uncertain homology. I don’t believe it is informative to test whether MUSCLE can correctly align a locally homologous region in a set of proteins that are not globally alignable. To me, that is a misuse of the method. Reasonable people might disagree, but regardless it is surely important to know that you’re testing a method on data it wasn’t designed to handle.

We should also ask how well benchmark alignments model real biological problems. They typically have a small number of highly diverged sequences, which is not typical.

Differences in accuracy between the better methods as measured by benchmarks are typically rather small — maybe a few percent at most. These differences are probably much smaller than the uncertainties caused by questionable benchmark alignments and the failure of benchmarks to predict performance on your particular problem, which is probably quite different in several important respects — maybe you have thousands of closely related sequences.

I believe the benchmarks wars have been increasingly isolated from practical biological problems over the past few years. Does an improvement of a couple of percent on BALIBASE imply a meaningful improvement in your application? This might be worth investigating, but my guess is no.

I believe the multiple alignment problem is, for all practical purposes, solved. If you’re aligning proteins in the twilight zone, the correct alignment is undefined by structure and unknowable by homology. If identities are higher, then all current methods to pretty well and you might as well use something fast and convenient. I see no reason to invest more effort in trying to improve the benchmark scores of MUSCLE, and I am therefore moving on to other things.

Given the explosion in sequence data, you might argue that creating big alignments is an important problem. But I think this is the wrong approach because you cannot avoid increasing numbers of errors (see Big alignments–do they make sense?), so other approaches are needed. I believe that fast classification methods (search and clustering) are more important, so that’s what I’m working on now with USEARCH and UCLUST.


6 responses to “Multiple protein alignment is a dead field

  1. Maybe true for proteins. I would like to see you spend a bit more time on optimizing for difficult nucleotide alignments.

  2. I align a lot of nucleotide alignments and (regarless of the alignment program that I use) most of the time is spent hand editing regions that are misaligned. Often times muscle will align a nucleotide region “somewhere” when it should be an insertion. Other times the tree dependent refinement will pull nearly identical regions of two sequences apart and align one of them somewhere else. Then there are cases where it’s not obvious what went wrong, but the fairly obvious correct alignment (in a region) wasn’t found. Again, I think this may be a problem with the tree dependent refinement. Maybe it should “remember” which sequences are closer in the tree. Does it weight the global score by closeness in the tree already? I finally figured out the undocumented ‘-termgaps full’ option in 3.7 which caused a lot of grief prior to that. The documented option was a flag -termgapsfull. In any case, I’m excited to check out the new version and I’m really happy there is a new manual. Thanks much for your great work! I use muscle all the time.

  3. Michael Wise

    Dear Robert,

    Perhaps MSA technology has gone as far as it can with the information it has to hand, but there are still reasonable things users would like MSA systems to do. Two tests from my lecture materials:

    *) Colman et al, J. Virol, 1993, 6:2972-2980. Here they had structures for 5 neuraminidase (NA) sequences from influenza and 10 combined neuraminidase sequences from combined hemagluttinin (HA)-neuraminidase. They realised that the NA moiety do exactly the same thing, but while the two sets can each be aligned, the entire set couldn’t (and still can’t be), so they compared the two set alignments by eye and pulled out the conserved residues. This eventually resulted in Relenza.

    *) Just about any set of serine proteases will cause MSA systems some grief. I use a set of 8 as a class example: CTRA_BOVIN, PRTA_STRGR TRY1_BOVIN TRYP_STRGR THRB_BOVIN PLMN_HUMAN ACH1_LONAC EL1_BOVIN While the example is a toy, just the other day I was working on a very similar problem involving a non-standard, papain-like cysteine protease.


    • Good point. I should use a qualifier like “general-purpose”. For any given protein family, you can probably do better by exploiting family-specific knowledge.

  4. Christophe Lambert

    Dear Robert,

    I did a PhD thesis 10 years ago, partially on MSA. One chapter of the document was on “How to assess MSA”. It is clear there is no way to prove an alignment is the true alignment. I proposed to say that “Reference alignments” are generally a good approximation of the truth (you cited the many reasons why it is not necessary the truth). Furthermore the “truth” depends on what you want to model by doing MSA. Is it evolution, structure conservation (1D, 2D or 3D), site conservation, function, domain, local or global, and probably other things. It is really difficult to distinguish the best current methods for aligning based on benchmark.
    You are right in considering multiple sequence alignement is a dead field if you consider the accuracy of “ab initio” alignments. However a lot of specifically designed algorithms can be used for best aligning a family or an other. Human experts knowing the biology behind the sequences are still needed …
    Another way to improve alignment was to improve the speed. I proposed to used preprocessing and hash search which is done currently in several programs. The most current use is to assemble read from genome sequencing.
    I don’t see even big improvements in the future in the MSA field, for quality as well as for speed.
    The assymptotic limit has been reached for MSA as well as for secondary structure prediction and most probably 3D structure also.

    Finally with all the high troughput methods we are in the MSA era. The vast majority of the bioinformatics methods are based on alignments.

    You are doinf a very good job with MUSCLE and UCLUST/USEARCH.

    Best Regards,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s