Blast! I’m re-inventing the wheel

Recently I added a local search option (UBLAST) to USEARCH. This means that I’m having to reinvent stuff that was cutting-edge research in the late 1980s and early 1990s. Comments and suggestions are solicited from experts in this area.

The basic idea in USEARCH is to use an index to quickly find a few candidate top hits from the database, which are then examined more carefully. Candidates are ranked by word count, which correlates well enough with sequence identity that it can pick out (say) a few hundred candidates from a few million proteins and have pretty decent sensitivity into the twilight zone. With an appropriate definition of sensitivity, that is — I’m not trying to find all homologs, just the closest.

I need to implement my own BLAST-like algorithm because the selected subset of the database changes with each query, so existing implementations can’t be used. The current UBLAST is a quick-and-dirty gapless seed-and-extend algorithm using exact 3-mer seeds for amino acids and 5-mers for nucleotides. I used BLOSUM62 with vanilla Karlin-Altschul e-values (no edge effect corrections etc.). This was envisaged as a baseline to work from, but I did a PDB40 test and to my surprise it had better sensitivity than ungapped NCBI BLAST, so I shipped the prototype.

To improve on this, I need to design and tune a number of implementation details. These have been well-studied, but I have some questions that the literature doesn’t seem to answer.

I’ve found that floating-point arithmetic is comparable in speed to integer on modern processors, so these days I write all my d.p. code in floating point since this is more flexible and avoids issues of over- and underflow. I noticed that using a full-precision VTML220 matrix in UBLAST gave about a 4% sensitivity improvement at 1% EPQ over the usual integer-rounded BLOSUM62, which could easily be spurious but seemed worth pursuing. A significant improvement is plausible considering that the E-value has an exponential dependence on the score, so small corrections to the score can have a large effect on the evalue. This also applies to gap penalties — the extension penalty is often 1, when perhaps 0.3 or 1.3 might be better and might give a further boost to sensitivity. The Brenner et al result [doi:10.1093/bioinformatics/bti627] that modern matrices are no better than BLOSUM could be an artifact of integer rounding.

This raises the problem of untangling contribution of the many parameters in the algorithm: matrix, gap penalties, x-drop, T (threshold score for a seed) and K-A parameters and corrections such as edge effects. Here is my current plan to attack this. I will write an ungapped Smith-Waterman engine with floating-point d.p. and use it to score all pairs in PDB40. This isolates the matrix since there are no other variables. I can then try different methods for computing E-values from these scores, e.g. to measure the effectiveness of edge corrections or composition-based statistics. If this gives consistent results for different matrices, that tells me the most effective method for E-value calculation and tells me if I can neglect any of the corrections. Or, if the results are not consistent, it tells me that E-value calculation is a black art and I need to fall back to trial and error with each variant of the algorithm. (I can’t do a curve-fit like SSEARCH because the subset of scored database sequences is small and biased when the algorithm is embedded in USEARCH). Then I can ask whether the sensitivity improvement for floating-point still holds. If it does, that motivates the effort to check whether it holds up in the gapped case, where it might give a worthwhile improvement over BLASTP.

UCLUST v2.1 now available

Today I posted uclust v2.1. For information on new features and fixes, see this page. It includes a new clustering method I call clumping, which is my name for clustering with the goal of identifying clusters (clumps) of pre-determined size. Members of a given clump should be more similar to each other than to members of other clumps. The motivation for clumping is to divide a set of sequences into pieces that are small enough for a given method to handle — say, multiple alignment or phylogenetic tree estimation.

Does anyone out there know of a term for this type of clustering? Please let me know!

Clumping is the basis of my new method for building very large alignments with MUSCLE. You can try this yourself now if you have both MUSCLE and UCLUST, it is described in the new UCLUST manual.

There is also a chimera detection algorithm (UCHIME) that can find chimeras de novo in a set of unclassified and unaligned reads. This is a work in progress, but is working pretty well so I thought I’d make it available.

Send me your big sequence sets!

(Update 20 July 2010): The latest version of USEARCH now supports building very large alignments, so you can try this yourself and the invitation has expired!

I’ve made some good progress on developing algorithms for aligning very large sets of sequences, so I’m inviting you to send me your large sets. I’d like to try the method on more real data. I’ll feed them to the algorithm and send you the alignment.

If you do take me up on this tempting free offer, then please let me know some background on the sequences and what you plan to do with a multiple alignment. I’m generally skeptical about making large alignments, so I’d like to understand better what people out there are trying to do. Also, please see the comments on outlier sequences at the end of this post.

So how does it work? My approach has been to explore ways to exploit the extremely fast search and clustering speeds of USEARCH and UCLUST to help guide a multiple alignment. I started by trying to build a single nearest-neighbor guide tree for the entire set by using USEARCH to find neighbors. This ran into two major difficulties: (1) building a good tree, and (2) growth in alignment length.

1. Building a good tree: It turned out that using USEARCH to find neighbors worked well close to the leaves of the tree, but not for identifying neighbors of larger groups. In retrospect, this is easy to understand because USEARCH (in its current form, at least) is a sequence-sequence search method, and what is really needed is a profile-profile search method.

2. Growth in alignment length: All methods capable of building large alignments are based on the “progressive” strategy, meaning that two profiles are aligned at each internal node of a guide tree. New gaps are introduced each time and every gap adds a new column to the multiple alignment. So the alignment length (=number of columns) usually increases in each iteration, and the number of added columns in the final alignment is roughly a constant x (number of nodes in the tree) = constant x (number of sequences). So with thousands of sequences, you get generally get thousands of columns, most of which are very gappy. Algorithms are sometimes criticized for doing this, but the correct biological alignment should often be expected to be very gappy, assuming that residue homology is your standard of correctness (i.e., two residues should be in the same column iff they are homologous). In variable regions, repeated cycles of insertions, deletions and inversions will tend to produce residues that are not homlogous. This reflects the limitations of the multiple alignment format, in particular its inability to correctly represent evolutionary history. This doesn’t mean that the gaps in an automated alignment are necessarily correct (probably many if not most are wrong), but it is not reasonable to critique, say, a large MUSCLE alignment just because it is very gappy.

A different approach mitigates both these issues. The idea is to split the problem into pieces that are small enough to align by conventional methods (e.g., MUSCLE). Suppose you have 100,000 sequences, but MUSCLE can align only 5,000 sequences. The solution is to divide the 100k sequences into 20 or so subsets of size <= 5k and align each of these. Then pick one representative sequence from each of the 20 subsets and align these to create a “master” alignment. The subset alignments can then be combined into a final alignment by using the master alignment as a guide. This approach could be used to align up to about 5,000^2 = 25M sequences.

We want to align closely related sequences whenever possible, since they can be aligned more accurately (this of course is the principle underlying progressive alignment). So we want to split sequences into subsets such that sequences within a subset are more similar to each other than to members of other subsets. In other words, we have a clustering problem. The simplest and fastest clustering method would be to find a ULCUST threshold such that the maximum cluster size is 5k. This usually won’t work well because often the cluster size varies widely, so if the maximum is set to 5k, there will be many small clusters. A better approach is to perform hierarchical clustering and cut the resulting tree at branches that maximize the subtree size at a value <= 5k. This works as follows: UCLUST is repeatedly applied using a predetermined set of decreasing identity thresholds, say 99%, 98%, 95%, 90%, 80%, 50% and 35%. At each stage, the representative sequences (“seeds”) from each cluster are passed to the next stage. This builds a multi-furcating tree in which each node is a seed sequence from one of these stages, edges from a node point to the members of its cluster. A top-down traversal of the tree can then be used to cut it into pieces of the desired size.

Outlier sequences can mess up a large alignment: every time an outlier is added, the number of errors will tend to increase disproportionately. In large sets, outliers are often caused by poor-quality or anomalous sequences that don’t really belong anyway. A quick and easy way to look for them is to run UCLUST at a relatively low identity, say 80% for nucleotides or 50% for proteins. Typically, you’ll find a few large clusters and many smaller clusters. The small clusters are the outliers: it’s worth trying to build a multiple alignment without sequences in the smaller clusters.

Measuring guide tree quality

I’m working on building trees from large sequence sets. To start with, I want to make guide trees for MUSCLE; long-term I want to make estimates of phylogeny. One approach is to USEARCH as a subroutine that finds the nearest neighbors of a sequence or a group of sequences.

Short-term, I want to make guide trees for MUSCLE for large sequence sets. Building the guide tree is currently the most expensive step in both time and memory, so building a tree more efficiently will allow much larger sets to be aligned.

I’m tackling this first because (i) MUSCLE doesn’t care (much) about branch lengths, so this simplifies the problem, and (ii) I often get requests for people wanting to align large numbers of sequences (usually not recommended; see  this post).

I’ve implemented an algorithm I call “middle-out” hierarchical clustering (as opposed to top-down or bottom-up). It works like this. Set a target number of sequences that are tractable for all-vs-all distance clustering by UPGMA, say 1000. Start by using UCLUST at some medium identity, say 80% for nucleotides. If you have too many seeds, repeatedly cluster at lower identities until you reduce the number of seeds to the desired target. Then, for each cluster, repeatedly cluster at higher identities until you reduce the number of seeds to the desired target. Conceptually, this builds the middle of the tree, the top and bottom levels are done by a more careful method.

This builds a tree, but now the question is how good is it? One approach is to compare with a phylogenetic tree built by a more robust method, say FastTree. This has a couple of problems. In general, the optimal guide tree is not the same as the phylogenetic tree (see the second MUSCLE paper for the reasoning). Also, the metrics used for tree comparison are usually topological, i.e. consider only the branching order, ignoring branch lengths. At first glance, this is good because MUSCLE doesn’t care about branch lengths, but…

Topological metrics like Robinson-Foulds don’t distinguish cases where branching order is important from cases where it is not. Suppose e.g. there are three identical sequences. The branching order is then arbitrary, and then R-F can ding you for no good reason.

This could be fixed by using a metric that is aware of branch lengths. But MUSCLE only uses the branching order, so a tree with good topology and bad branch lengths would get dinged for no good reason.

What attributes of the tree do we really care about? To be a good guide tree, it should bring more closely related sequences together before more distantly related sequences. I want a ‘neighborliness’ measure to capture this. Here is one way it could be done. For each node, measure the mean pair-wise identity of sequences in its two subtrees. Then average over all nodes. I claim that higher average = better guide tree.

For algorithm development purposes, it is helpful to know if a sequence is added at a sub-optimal time in the construction of the multiple alignment. This can be done using a similar approach. For each sequence, measure its average pair-wise identity to the opposite subtree at the point where it is added. For example, if an internal node of the guide tree joins (A,B) to (C,D,E), then for sequence A compute the average pair-wise identity of AB, AD and AE (which should be computed independently of the multiple alignment). Now we can compare the quality of two trees T1 and T2 from the point of view of each sequence. Let Q(T1,A) be the average identity I just defined. Then, if Q(T1,A) >> Q(T2,A), we can say that T2 has a bad topological placement of A, and this allows us to look more closely at the tree-building related to sequence A. So a handy report would display the differences |Q(T1,S) – Q(T2,S)| for all sequences S sorted by magnitude. Big differences identify cases where one tree did a bad job of the topology for S.

Fishing for significance

Recently I read a blog post that pointed me at a very interesting paper in Bioinfomatics called “Over-optimism in bioinformatics research“. Anne-Laure Boulesteix describes a pervasive problem called “fishing for significance”. I quickly realized that (1) she’s right, and (2) mea culpa — I have been fishing for years.

Here’s what happens. You’re developing a new algorithm, say a multiple sequence alignment algorithm, or you’re trying ideas for improvements (new version of MUSCLE). Would neighbor-joining or UPGMA work better for the guide tree? Would 3-mers or 4-mers work better for fast tree building? Does sequence weighting help or hurt [I think it actually hurts, but that’s a story for another post]? To answer these questions, you write some code to implement each idea and run the program on a standard benchmark, which would be (say) BALIBASE for protein alignments. You keep the ideas that “work” (meaning, get a better score on BALIBASE), and throw the rest away.

This was exactly the strategy I used in developing MUSCLE, and I believe it is generally how algorithms are developed in this field. If there is no standard benchmark, you develop your own, and then the same applies. Sometimes people describe this process (I did in my second MUSCLE paper, and the Opal paper also makes it clear), but more often only the final design is presented.

This strategy results in a hidden form of over-tuning to the benchmark. If you’ve done this, then it’s too late to do fancy statistical footwork like splitting the data into testing and training sets for parameter tuning; you’ve already over-tuned a bunch of boolean yes/no parameters that select elements of the algorithm. Ideally, this would be solved by blind testing as done in the CASP competitions. But in most areas of computational biology you can’t do this, and it’s hard to see a better way to approach algorithm development. I think the main lesson here is that we should be more skeptical of published validation and benchmark results: very often improvements in the state of the art and claims of statistical significance are greatly exaggerated. This is certainly true in multiple alignment, where the latest BALIBASE is disastrously bad and other benchmarks are poor models of alignment problems encountered in practice.

An unemployed gentleman scholar

A comment to a previous post asks me for some personal information: “I’ve noticed that you never list a university, firm, or non-profit affiliation on your papers or website. Would you mind writing a post about how you got to be where you are, who supports your work, the reactions of reviewers to papers from outside the university/well-known industrial research lab circle, and the like? I for one would be terribly interested in your personal ascent to greatness outside the establishment.” Well, I’m as susceptible to flattery as the next guy, so here goes. First off I should say that my career path was not planned or plannable, so you can’t follow in my footsteps except to go where there is no path.

From a young age (7 years old, Luna 9 moon landing pictures 1966), my vocation was to be a scientist, and I quickly settled on theoretical physics because physics is everything — any explanation of the natural world eventually bottoms out in the laws of physics (yeah, I’m a hard-core reductionist and proud of it). At high school and college, I was pretty good at math and physics, but once I started my PhD I got a shock because I discovered that many people were much smarter than me. A few months after starting my postdoc I concluded that I would always be a mediocre physicist and would never make much of a contribution, so I quit. That was very difficult because physics had been my life’s dream, but in retrospect physics was the wrong field for me; I’m more of a natural programmer than a natural mathematician. Like many failed physicists in the mid-80s, I got into the software business and ended up starting my own company in San Francisco. I sold the business to Intel in 1999, and in 2001 I was burned out and quit with no idea what to do next.

I didn’t want to start another business, which is what most entrepreneurs do, and I didn’t want to retire and play golf (or more likely tennis). I knew about the human genome project and the importance of software algorithms — it was very appealing to me as a software guy that you could do important things just with strings of letters without knowing all that tedious biochemistry. Biology=strcmp(). So while it didn’t occur to me that I might actively work on that stuff, I thought it would be fun to learn something about it, and crashed a seminar at UC Berkeley and met a newly-hired professor named Kimmen Sjolander. She was looking for students to help her code up some algorithms for summer research experience, and I volunteered. The result was a multiple alignment program called SATCHMO, which worked pretty well but was no better than CLUSTALW according to BALIBASE, which was the only available benchmark at that time (2003).

Being a competitive guy, I was dissatisfied with that, and started a systematic project to figure out what did and didn’t work in multiple alignment algorithms, because it seemed to me that the programs had many ideas in them, but it wasn’t clear which of the ideas helped or hurt the results. At that time, Gotoh’s PRRx had the best accuracy (except maybe for T-Coffee; I don’t remember) but the code was unfriendly and slow and hardly anybody used it. So I started from the PRRx algorithm and systematically varied the elements of the algorithm with all the alternatives I could find in the literature or dream up myself. That was the line of work that led to MUSCLE. (As an aside, the MAFFT people followed a similar strategy independently, and it’s remarkable how much the original MAFFT and MUSCLE v1 resembled each other. MAFFT got published first and was better than my first attempt; I had to ‘borrow’ a couple of their ideas in order to do better than MAFFT. For example, we both had k-mer counting to build the first tree, but I was using 3-mers in the usual alphabet and they used 6-mers in a compressed alphabet which gave slightly better results. A nice idea on their part, but in retrospect I shouldn’t have copied it because it is a BALIBASE artifact so a bad example of fishing for significance).

So to answer the question: after selling my business I had some financial independence, and I’ve supported myself and my research from my savings for the past decade. You can think of me as unemployed, independent and/or a gentleman scholar of modest means (like, say, my fellow-countryman Charles Darwin). It’s hard to know how people have perceived my unconventional status. Everyone feels misunderstood and dismissed by peers sometimes (reviewers, editors, conference organizers…), so it’s impossible to say whether I would have been better accepted if I had a conventional affiliation. MUSCLE has been helpful because many people have heard of it (almost 3,000 citations so far per Google Scholar). That gave me some street cred early on, and surely helped open other doors.

I don’t see any lessons here to help young scientists, so in an attempt to avoid becoming a bad influence I will defer to a more accomplished scientist who has much better advice. I highly recommend this lecture from Richard Hamming. (If the link breaks, google “you and your research” “richard hamming”).

Multiple protein alignment is a dead field

I believe that traditional multiple protein alignment algorithm development has reached a point of diminishing returns. I regard it as an essentially solved problem for practical purposes, and the marginal progress that could be made is impossible to measure. It is — or should be — a “stagnant field”, to quote a recent critic of one of my papers.

Some history first. CLUSTALW is one of the most highly cited methods in science. The publication of BALIBASE in 1999 triggered a benchmark war, stimulated no doubt by the importance of multiple alignment to a wide range of problems in biology, plus the career advantages if your method got a lot of citations. (I am not excepting myself here — MUSCLE is widely known and has been cited about 3,000 times, which has opened up many opportunities for me).

BALIBASE was a reasonable first attempt at a benchmark, and while I believe versions 1 and 2 are not as good as many people believed, they are defensible. However, there are several critical questions that are rarely asked or satisfactorily answered. What is the definition of a correct alignment? How exactly were the benchmark alignments made? How do the authors know they are correct? Can we verify that they are correct?

There are different possible definitions of a correct alignment, and they do not always agree. For example, we can require that homologous residues appear in the same column. In general, this is impossible in a multiple alignment because it is an over-simplification of evolutionary history (see Big alignments–do they make sense?). An alternative is to require that structurally equivalent residues are aligned, but structural equivalence gets fuzzy as structures diverge; there is no unique definition of which individual residues are structurally equivalent. This is shown by the fact that different structural alignment algorithms and different human experts will sometimes disagree about residue correspondences. A better way to think about alignments of very distantly related proteins may be to think at a level of secondary structure elements or folds rather than individual residues. All current benchmarks (BALIBASE, PREFAB, OXBENCH and SABMARK) provide reference alignments and make assessments based on whether individual residues are aligned in agreement with the benchmark. This is reasonable if you take an average over a large number of residues in a large number of alignments, but not if you assume that individual residues in a given reference alignment are all correct. This is a fundamental flaw in some recent assessments that claim to measure specificity (e.g. the FSA paper recently published in Plos Comp Bio). In the case of BALIBASE, there are many examples of structures where homology is unclear: the folds are similar, but experts and databases such as SCOP and CATH do not consider evidence of homology strong enough to place them in the same superfamily. In these cases, it clearly not possible to make reference alignments that are reliable at the level of individual residues. If you can’t be sure that the folds are homologous, you certainly cannot be sure which residues are homologous — maybe none of them are!

BALIBASE v3 is a different case. As I recently showed in my paper “Quality measures for protein alignment benchmarks“, version 3 was mostly aligned by sequence methods, which makes it unsuitable for assessment of multiple alignment programs. Further, it contains egregiously wrong alignments in which non-homologous domains with radically different folds are aligned to each other. The reference alignments in v3 are homologous only in isolated regions surrounded by non-homologous domains. This data grossly violates the assumptions of the global alignment methods like CLUSTALW and MUSCLE that BALIBASE is often used to assess. This makes no sense. Consider, say, mitochondrial chromosomes. These undergo rearrangements, so are not globally alignable even in closely related species. So it wouldn’t make sense to attempt to align them with CLUSTALW. Similarly, BALIBASE v3 is not globally alignable, so it makes no sense to align it by a global method. The position of the BALIBASE authors appears to be, (1) fold-level alignments are hard, therefore they should be in a benchmark, and (2) aligning locally homologous regions in proteins that are not globally alignable is hard, so they should be in a benchmark. That is a reasonable argument, but only if your benchmark provides adequate annotation of the data and defines accuracy measures that are informative for this type of data. If you’re doing structure prediction by homology modeling, fold-level alignments make sense, and are assessed for example in the CASP competition.  But BALIBASE claims to have correct residue-level alignments and defines correctness as exact agreement with its residue correspondences, which is nonsense when the folds have uncertain homology. I don’t believe it is informative to test whether MUSCLE can correctly align a locally homologous region in a set of proteins that are not globally alignable. To me, that is a misuse of the method. Reasonable people might disagree, but regardless it is surely important to know that you’re testing a method on data it wasn’t designed to handle.

We should also ask how well benchmark alignments model real biological problems. They typically have a small number of highly diverged sequences, which is not typical.

Differences in accuracy between the better methods as measured by benchmarks are typically rather small — maybe a few percent at most. These differences are probably much smaller than the uncertainties caused by questionable benchmark alignments and the failure of benchmarks to predict performance on your particular problem, which is probably quite different in several important respects — maybe you have thousands of closely related sequences.

I believe the benchmarks wars have been increasingly isolated from practical biological problems over the past few years. Does an improvement of a couple of percent on BALIBASE imply a meaningful improvement in your application? This might be worth investigating, but my guess is no.

I believe the multiple alignment problem is, for all practical purposes, solved. If you’re aligning proteins in the twilight zone, the correct alignment is undefined by structure and unknowable by homology. If identities are higher, then all current methods to pretty well and you might as well use something fast and convenient. I see no reason to invest more effort in trying to improve the benchmark scores of MUSCLE, and I am therefore moving on to other things.

Given the explosion in sequence data, you might argue that creating big alignments is an important problem. But I think this is the wrong approach because you cannot avoid increasing numbers of errors (see Big alignments–do they make sense?), so other approaches are needed. I believe that fast classification methods (search and clustering) are more important, so that’s what I’m working on now with USEARCH and UCLUST.