Counting number of nucleotide substitutions - synonymous vs nonsynonymous (Apr/17/2006 )
I want to compare the number of synonymous and non-synonymous nucleotide substitutions between some sequences with the varying length of 10 to 100. But through my searches in the web, I have found numerous methods for this task and I really do not know which one to use that is both widely accepted and not too old. Also I need a text that has described the method step by step so that I would be sure I am doing it right. I would be delighted if someone helps me on this probelm.
I am not sure what you mean! Observing synonymous and non-synonymous mutations is simple and there is only one way of doing so. You require your protein sequence and the corresponding CDS for each of your targets. I did this years ago for my masters project - based on work by Yasuo Ina. You then require a look-up table with all the triplets and their associated amino acid - substring along the DNA in all ORFs and look at the DNA coding - if the look-up amino acid differs then you have a non-synonymous mutation otherwise it's synoymous. I suggest you use perl language and write your own method - it's a very nice program to introduce you to the language.
I agree with DPK -- I was going to suggest Perl, but it was already mentioned.
Check out MEGA3 from Nei and Kumar here
I am guessing what you were asking for was a method for inferring substitution rates. Original method for that inference is called Nei and Gojobori evolutionary pathway method, that uses the universal codon table and assumes that there is no codon bias. If you want to get more fancy with it, you will have to read Neilsen and Yang papers. Their associated software is called PAML.
There are several other methods for that, almost all of which are implemented in MEGA. It has a tutorial, and there is also a book by the authors explaining each method and what the differences between them are.
That's two different things -- getting a count of substitutions between two sequences is easy, estimating a rate at which such substitution occur is another thing entirely, and is harder to calculate. Which are you trying to do?
Thanks for all your help.
Actually, I myself was kind of puzzeled that I need to find out the count the synonymous vs nonsynonymous difference or their substitution rate (as you all mentioned). However, isn't that true that to be able to infer evolutionary data from the syn/nonsyn difference between two sequences, one needs to know the substitution rate? something like this: The rate expects the difference to be ... but by counting I have found ... . Is there a gerenally accepted rate used for this task or I am totally out of the picture?
And by the way, all I know of programming is some C++ that has helped till now, but it is a little hard to master and debuging the programs takes pretty much time. If PERL is really good maybe I should shift and start learning that language.
Nice post L_Han - I haven't read about that software.
I don't follow what you mean by infer evolutionary data, sorry. It appears to me like you are going to be doing loads of reading very soon for where you are about to tread is dangerous ground!
Evolutionary biology is tricky - there are camps that like methods A and B whilst others prefer C and D - just have a look around for maximum likelihood and maximum parsimony fans introduce them and watch them fight! As L_Han suggested look at the work by Nei, Gojobori and Neilsen and Yang (http://abacus.gene.ucl.ac.uk/)
If you want general rate matrices then you have to look at PAML & BLOSUM pick the one you want and run with it - you need to be able to defend your choice - so perhaps it is best to do some sort of run with all and complete a statistical test to validate your choice - I am not a fan of this stuff - the group next door do this stuff euggh.
As for language - if you can do a little bit of C++ you will find perl a joy - the syntax is far less complicated the language is so so so much more forgiving .... if you do have a look at it make sure to put the following bits at the top of each script you write:
It will make your life so much easier. -but remember not to forget your C++ !
I have already read the articles you suggested. But maybe I need to read them again and find out exactly which method I should use. That MEGA software is really interesting! Thanks for the comments and precautions.
You need to have a hypothesis to test, going into the rate stuff- so be carefull.
According to the neutral theory, if a locus is neutral you expect synonymous substitution rate to be equal to nonsynonymous substitution rate, dN/dS=1. If you have a dN/dS ratio <1 it may indicate negative selection- indicating non-synonymous substitution at the locus is not favored. If dN/dS >1 it indicates positive selection and that the non-synonymous substitutions are favored...
However- there is very large evolutionary variance around this estimates, and you can not just estimate them by counting- that is why there is such a heated debate over how to estimate them...
It also depends on if you have diversity data(within species) or divergence data( between species) which limits the number and type of methods available to you...
As DPK suggests- i recommend some reading and if possible go bother someone accessible for more information on this things. It is dangerous ground...
I do have a hypothesis. But I want to check the rates and substitutions between parts of sequences together, not the hole sequences (For instance a part of a domain with the similar one in other species). It has been in my mind that simple counting can not be the way and the matter should more complicated and tricky as everyone said. I am still trying to study the methods more thoroughly to see which one better accords with the hypothesis and the data I have available. And I would appreciate if you explain more about diversity data and divergence data or refer me to somewhere I can read about it.