UCLUST v2.1 now available

Today I posted uclust v2.1. For information on new features and fixes, see this page. It includes a new clustering method I call clumping, which is my name for clustering with the goal of identifying clusters (clumps) of pre-determined size. Members of a given clump should be more similar to each other than to members of other clumps. The motivation for clumping is to divide a set of sequences into pieces that are small enough for a given method to handle — say, multiple alignment or phylogenetic tree estimation.

Does anyone out there know of a term for this type of clustering? Please let me know!

Clumping is the basis of my new method for building very large alignments with MUSCLE. You can try this yourself now if you have both MUSCLE and UCLUST, it is described in the new UCLUST manual.

There is also a chimera detection algorithm (UCHIME) that can find chimeras de novo in a set of unclassified and unaligned reads. This is a work in progress, but is working pretty well so I thought I’d make it available.

Advertisements

3 responses to “UCLUST v2.1 now available

  1. Have you ever worked with the problem of aligning multi-domain proteins? Especially when the domain(s) in question are what are called promiscuous or versatile – i.e. these domain(s) may be found in several different combination with other protein domains, and often in scrambled order, so that the highly variant linear order of domains precludes a good MSA using a global-align strategy….

    • See Multiple alignment of protein sequences with repeats and rearrangements. Phuong TM, Do CB, Edgar RC, Batzoglou S. Nucleic Acids Res. 006;34(20):5932-42. Epub 2006 Oct 26. Quite honestly, this doesn’t work very well with distantly related domains, which is the hard & interesting case.

  2. Yeha, I’ve been trying to use ProDA, but my sequence is short and quite divergent (Pfam HMM – PF00646.26 from release 24 I think). As you indicated when domain sequences are sufficiently divergent, it is hard to get a good alignment with ProDA.
    So one work around I can imagine is to write a script that takes the regions assigned to each Pfam domain out from all taxa in the MSA, align each domain individually using hmmalign, and then splice them back together along with the rest of the linker aa sequences aligned with regular MSa software like MAFFT.
    But this will work ONLY for proteins that share the EXACT same domain architecture. What do I do when, as in my case, domain sequences are becoming divergent AND there is LOT of scrambling of domain architectures that preclude easy manual or PERL script assisted alignment? And I dont think I undertand the math behind A-Bruijn or ProDA! 😦

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s