Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

Counting number of nucleotide substitutions - synonymous vs nonsynonymous (Apr/17/2006 )

Pages: Previous 1 2 

>I do have a hypothesis.
Could you share a bit of it? Just out of interest and it may also help us suggest things to you.

>But I want to check the rates and substitutions between parts of sequences together, not the whole >sequence (For instance a part of a domain with the similar one in other species).
Errm from a proteins point of view you are unlikely (I can't think of a better way of phrasing this) to witness mutations in functional regions instead you are likely to see changes occuring in the external - surface bound - acids as they are less bound by evolutionary constraints (exceptions perhpas being the active site) - this is the basis of comparative modelling!

> It has been in my mind that simple counting can not be the way and the matter should more >complicated and tricky as everyone said. I am still trying to study the methods more thoroughly to see >which one better accords with the hypothesis and the data I have available.
My suggestion to read the literature was stupid (to a point) - if you search for the paper by Yasuo Ina (I will post the ref if i find it) you will see that it is 'simple' but still more complicated than counting however by current standards this method is not / would not be considered comprehensive ( i imagine getting something published may be tricky) what I am trying (and failing) to get at is that the problem is horribly complex already and you don't want to reinvent the wheel (do you?) - you are best of looking at the entire sequence mapping 'hot-spots' for mutations and then seeing if there is any correlation with known domains within a protein... you may then start to look into things like correlated mutations (if you have nice structures!)


>And I would appreciate if you explain more about diversity data and divergence data or refer me to >somewhere I can read about it.
My interpretation is that the first instance (diversity data ) covers the variations of gene x and its associated products within a single species i.e the variations you may witness in something like p53 - the cell terminator protein - in humans alone. The divergence data would however take into consideration the difference between mouse, human and say rat (see my masters thesis) again in p53 (we didn't see any mutational hotspots using our method - this was expected).

-DPK-

QUOTE (zaanaa @ Apr 19 2006, 07:58 AM)
I do have a hypothesis. But I want to check the rates and substitutions between parts of sequences together, not the hole sequences (For instance a part of a domain with the similar one in other species). It has been in my mind that simple counting can not be the way and the matter should more complicated and tricky as everyone said. I am still trying to study the methods more thoroughly to see which one better accords with the hypothesis and the data I have available. And I would appreciate if you explain more about diversity data and divergence data or refer me to somewhere I can read about it.

OK. Lets see...
Let me explain the diversity divergence thing first-

When you have sequence data to use from several individuals of the same species, the polymorphisms you see comparing those sequences are refferred to as diversity within species. Polymorphisms are mutations that were able increase in frequency within a species population- at say , time T, after speciation, and happened in the genetic background that is specific to species A- that was uniform until then...
[edit: i am sorry it is impossible to draw trees here- it is just not displayed right-sad.gif]
_____Species A-allele1
_________Species A|____Species A-allele2
|_________SpeciesB


Something like this. So when you are comparing species A with species B, every substitution you see are lineage specific, and you don't know anything about the level of diversity within these species populations.
Hence- divergence based methods compare the changes that are specific to speciation events, while diversity based methods compare changes within a species.
It is important because, changes that occur within a species have the ability to recombine with each other, and recombination- unless accounted for- would violate implicit assumptions of population genetic inferences about- say time to most common recent ancestor- or mutation rate, or effective population size, etc.
And within species, you can look at changes at the nucleotide level(transitions vs. transversions),or codon level(synonymous- non-synonymous polymorphism) while between species, it is preferred to only look at changes at the codon and protein level.
There are several models to define, neutral substitution rates, which are generally referred to with 3 letter acronyms, which you will eventually need to figure out for your self like PBL- Pamilli-Bianchi-Li or GTR, General-time reversible- followed by a number, which generally refers to the year that the method was published...
I don't know what to refer you to read- really. I guess, best place to start would be a Intro to population genetics book- like Hartl and Clark( although i personally don't like this one) or Hedrick would do the job for you. That book by Nei& Kumar on phylogenetics i referred to before would be nice too.
When you bulid a better understanding of the concept, you should read the PAML documentation, you can get from here
Also, you might want to check some of the ancient papers, like the paper that first describes famous genes like ADH, HLA, MHC, G6PD etc. to be under selection.
Let us know if you need specific citations but you can probably get a better idea if you Google-Scholar it...
Good luck

Ps. Oh and also check out the neutral theory (Kimura 1970) and nearly neutral theory (Ohta 1971)..

-L_Han-

Pages: Previous 1 2