Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

Help with ClustalW - (Aug/04/2007 )

I'm trying to show that two sequences are not similar using ClustalW and I do not understand what the scores mean, nor what a significant score means. When I try blasting one protein or the other neither comes up as a match. I would appreciate if someone could explain the ClustalW scores.

Thanks

-RandomGuy187-

Hi!

I have never used ClustalW for that... SO I can not help but did you try to find help on the software website. There could be the answer.
http://www.ebi.ac.uk/Tools/clustalw/

there is a FAQ and help.

I hope you find your answer here!

Caro

-atlantide-

Usually, you would try to prove that two sequences are similar, but you want to prove that they are dissimilar.
One such line of proof would be the simple fact that clustalw is unable to align them. It is maybe more straightforward to simply BLAST the two sequences and look at their bitscore and e-values. A low bitscore (depending on length) means a low similarity. A high e-value, e.g higher than 0.1, suggests that there is a high chance of getting a better hit purely by random.

BLAST would also give you a similarity score in the percentage of amino acids (or nucleotides) that are similar. For two random DNA sequences, the similarity would be 25%.

-SDH-

So ClustalW isn't the greatest tool in the world, indeed it really dies when two dissimilar sequences are used as input. I would use muscle in the future. If you have to sequences and you want to get a measure of distance then the method posted above seems reasonable. You could also try using a tool like needle (follow the EMBOSS link above) and do a pairwise alignment. This will essentially return junk - which is what you want.

-perlmunky-

A principled way to do this dissimilarity measure is to randomize the sequence of one of the two sequences and compare the quality (or lack of quality) of the alignments. Unless the two sequences align substantially better than random ones do, then you are pretty sure they are not related. Of course you can do this many times with different randomized samples to collect good statistics on the random sequences.

-phage434-