Recently I added a local search option (UBLAST) to USEARCH. This means that I’m having to reinvent stuff that was cutting-edge research in the late 1980s and early 1990s. Comments and suggestions are solicited from experts in this area.
The basic idea in USEARCH is to use an index to quickly find a few candidate top hits from the database, which are then examined more carefully. Candidates are ranked by word count, which correlates well enough with sequence identity that it can pick out (say) a few hundred candidates from a few million proteins and have pretty decent sensitivity into the twilight zone. With an appropriate definition of sensitivity, that is — I’m not trying to find all homologs, just the closest.
I need to implement my own BLAST-like algorithm because the selected subset of the database changes with each query, so existing implementations can’t be used. The current UBLAST is a quick-and-dirty gapless seed-and-extend algorithm using exact 3-mer seeds for amino acids and 5-mers for nucleotides. I used BLOSUM62 with vanilla Karlin-Altschul e-values (no edge effect corrections etc.). This was envisaged as a baseline to work from, but I did a PDB40 test and to my surprise it had better sensitivity than ungapped NCBI BLAST, so I shipped the prototype.
To improve on this, I need to design and tune a number of implementation details. These have been well-studied, but I have some questions that the literature doesn’t seem to answer.
I’ve found that floating-point arithmetic is comparable in speed to integer on modern processors, so these days I write all my d.p. code in floating point since this is more flexible and avoids issues of over- and underflow. I noticed that using a full-precision VTML220 matrix in UBLAST gave about a 4% sensitivity improvement at 1% EPQ over the usual integer-rounded BLOSUM62, which could easily be spurious but seemed worth pursuing. A significant improvement is plausible considering that the E-value has an exponential dependence on the score, so small corrections to the score can have a large effect on the evalue. This also applies to gap penalties — the extension penalty is often 1, when perhaps 0.3 or 1.3 might be better and might give a further boost to sensitivity. The Brenner et al result [doi:10.1093/bioinformatics/bti627] that modern matrices are no better than BLOSUM could be an artifact of integer rounding.
This raises the problem of untangling contribution of the many parameters in the algorithm: matrix, gap penalties, x-drop, T (threshold score for a seed) and K-A parameters and corrections such as edge effects. Here is my current plan to attack this. I will write an ungapped Smith-Waterman engine with floating-point d.p. and use it to score all pairs in PDB40. This isolates the matrix since there are no other variables. I can then try different methods for computing E-values from these scores, e.g. to measure the effectiveness of edge corrections or composition-based statistics. If this gives consistent results for different matrices, that tells me the most effective method for E-value calculation and tells me if I can neglect any of the corrections. Or, if the results are not consistent, it tells me that E-value calculation is a black art and I need to fall back to trial and error with each variant of the algorithm. (I can’t do a curve-fit like SSEARCH because the subset of scored database sequences is small and biased when the algorithm is embedded in USEARCH). Then I can ask whether the sensitivity improvement for floating-point still holds. If it does, that motivates the effort to check whether it holds up in the gapped case, where it might give a worthwhile improvement over BLASTP.