Protocol Online logo
Top : New Forum Archives (2009-): : Bioinformatics and Biostatistics

Ouput of MUSCLE alignment shortens names, how can I get the long names back agai - (Dec/18/2009 )

Hello,
I'm aligning a bunch of sequences with MUSCLE by inputting a FASTA file with regular header info such as this:
>gi|280987219|gb|GQ200200.2| Cohaesibacter sp. DQHS-21 16S ribosomal RNA gene, partial sequence

However, the ouput file from MUSCLE shortens the names, to something like this: gi|2809872. This seems to be a common thing in phylogenetics software too (I think the phylip format has a short name as well). I presume this is so that reference ID's are passed around the program instead of the full name. But what is the easiest way to get the names back again? For example, I'm feeding the MUSCLE alignment file into RaxML, which creates a tree file (I'm not sure what format this is in) and then I want to look at the names on the branches, not just an ID.

Any help much appreciated,
Thanks,
Phil

-PhilS-

Hi,

You unfortunately can't do much about this directly with muscle or many other phylogenetic programs, however you can make it easier to deal with.

Enter REFGEN and TREENAMER, these are two programs I have written in my spare time - read the paper here - they take the standard headers from NCBI GenBank and DOE JGI Genome Project which carry a lot of unhelpful information for your resulting trees (and which cannot be handled by most phylogeny programs as you have noticed) and creates an ID from the accession and species name.

This is obviously shortening the header again but once you have a tree from your analysis, you can use the second tool to replace the ID code with species name and/or accession which are the important parts of the header...

Hope that helps...

-guyleonard-

GREAT, thanks so much
Phil

guyleonard on May 4 2010, 02:13 AM said:

Hi,

You unfortunately can't do much about this directly with muscle or many other phylogenetic programs, however you can make it easier to deal with.

Enter REFGEN and TREENAMER, these are two programs I have written in my spare time - read the paper here - they take the standard headers from NCBI GenBank and DOE JGI Genome Project which carry a lot of unhelpful information for your resulting trees (and which cannot be handled by most phylogeny programs as you have noticed) and creates an ID from the accession and species name.

This is obviously shortening the header again but once you have a tree from your analysis, you can use the second tool to replace the ID code with species name and/or accession which are the important parts of the header...

Hope that helps...

-PhilS-