Fishing for significance

Recently I read a blog post that pointed me at a very interesting paper in Bioinfomatics called “Over-optimism in bioinformatics research“. Anne-Laure Boulesteix describes a pervasive problem called “fishing for significance”. I quickly realized that (1) she’s right, and (2) mea culpa — I have been fishing for years.

Here’s what happens. You’re developing a new algorithm, say a multiple sequence alignment algorithm, or you’re trying ideas for improvements (new version of MUSCLE). Would neighbor-joining or UPGMA work better for the guide tree? Would 3-mers or 4-mers work better for fast tree building? Does sequence weighting help or hurt [I think it actually hurts, but that’s a story for another post]? To answer these questions, you write some code to implement each idea and run the program on a standard benchmark, which would be (say) BALIBASE for protein alignments. You keep the ideas that “work” (meaning, get a better score on BALIBASE), and throw the rest away.

This was exactly the strategy I used in developing MUSCLE, and I believe it is generally how algorithms are developed in this field. If there is no standard benchmark, you develop your own, and then the same applies. Sometimes people describe this process (I did in my second MUSCLE paper, and the Opal paper also makes it clear), but more often only the final design is presented.

This strategy results in a hidden form of over-tuning to the benchmark. If you’ve done this, then it’s too late to do fancy statistical footwork like splitting the data into testing and training sets for parameter tuning; you’ve already over-tuned a bunch of boolean yes/no parameters that select elements of the algorithm. Ideally, this would be solved by blind testing as done in the CASP competitions. But in most areas of computational biology you can’t do this, and it’s hard to see a better way to approach algorithm development. I think the main lesson here is that we should be more skeptical of published validation and benchmark results: very often improvements in the state of the art and claims of statistical significance are greatly exaggerated. This is certainly true in multiple alignment, where the latest BALIBASE is disastrously bad and other benchmarks are poor models of alignment problems encountered in practice.


2 responses to “Fishing for significance

  1. I agree that this is the way to develop comp. biol. methods. However, a simple solution (if your dataset or benchmark in non-redundant) is to keep, say 20%, of the data untouched, and work on the rest for the entire time of development. Then, only when you achieved your goal performance on those 80% (over-tuning the hell of it) Test on the virgin 20% and you’re good.

    • That’s what I used to believe, but now I think doing a standard test / training set split usually results in overtuning. The reason is that a typical benchmark set has systematic biases. For example, the SABMARK multiple alignment benchmark is built from solved structures. The sequences are single domains, so tend to be shorter than a full-length protein, and the input sets have small numbers of sequences because of the way they were selected. So tuning to a training set randomly selected from SABMARK may bias your method to small numbers of short, highly diverged sequences. The test / training set split is only robust if your benchmark is a good model of the distribution of problems that the algorithm might encounter in practice. That is rarely, if ever, the case in computational biology. CASP comes pretty close, but even there you could argue there is a bias towards structures that are solvable and medically important.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s