Recently I read a blog post that pointed me at a very interesting paper in Bioinfomatics called “Over-optimism in bioinformatics research“. Anne-Laure Boulesteix describes a pervasive problem called “fishing for significance”. I quickly realized that (1) she’s right, and (2) mea culpa — I have been fishing for years.
Here’s what happens. You’re developing a new algorithm, say a multiple sequence alignment algorithm, or you’re trying ideas for improvements (new version of MUSCLE). Would neighbor-joining or UPGMA work better for the guide tree? Would 3-mers or 4-mers work better for fast tree building? Does sequence weighting help or hurt [I think it actually hurts, but that’s a story for another post]? To answer these questions, you write some code to implement each idea and run the program on a standard benchmark, which would be (say) BALIBASE for protein alignments. You keep the ideas that “work” (meaning, get a better score on BALIBASE), and throw the rest away.
This was exactly the strategy I used in developing MUSCLE, and I believe it is generally how algorithms are developed in this field. If there is no standard benchmark, you develop your own, and then the same applies. Sometimes people describe this process (I did in my second MUSCLE paper, and the Opal paper also makes it clear), but more often only the final design is presented.
This strategy results in a hidden form of over-tuning to the benchmark. If you’ve done this, then it’s too late to do fancy statistical footwork like splitting the data into testing and training sets for parameter tuning; you’ve already over-tuned a bunch of boolean yes/no parameters that select elements of the algorithm. Ideally, this would be solved by blind testing as done in the CASP competitions. But in most areas of computational biology you can’t do this, and it’s hard to see a better way to approach algorithm development. I think the main lesson here is that we should be more skeptical of published validation and benchmark results: very often improvements in the state of the art and claims of statistical significance are greatly exaggerated. This is certainly true in multiple alignment, where the latest BALIBASE is disastrously bad and other benchmarks are poor models of alignment problems encountered in practice.