General CpG island questions - (Jun/29/2005 )
I have a few general questions about CpG islands.
I have heard two differing definitions of the CpG island, one from frommer and gardinier which I understand to be the old outdated definition and one by takai which is understood to be the new "Gold standard". What I don't quite understand is what the O/E (observed over expected) ratio is an analysis of. If the expected CpG ratio is 50% or .5 and the observed is slightly higher, shouldn't the ratio be above 1.0? Am I completely missing something?
The other question is that I know that approximately 80% of CpGs in the genome of vertabrates are subject to methylation. But does anyone know approximately what percentage of vertebrate promotors are presumed to be methylated?
Silly questions, but if anyone could help my understanding of this that would be really helpful
these are not silly questions at all and I am still trying to get my head around it.
indeed there are two definitions of a CpG island, with some in the field saying the Takai definition is the gold standard, but as Pcrman has pointed out quite nicely that is not always the case.
The definition by Takai, reduces the liklihood of returning a CpG island that is in fact a repetitive element. The issue is that Gardiner-Garden's definition includes elements such as Alus which can be GC-rich and these are thought to be highly methylated.
I have seen recent papers (published in the past months) still siting Gardiner's definition with mention of Takai's, there are a good set of reviews in Biochemistry (Moscow). Futhermore, it seems on the UCSC browser at least, Gardiner's definition still holds true (just take a look at their CpG island Track).
As for the observed to expected ratio, this is a measure of the number of CpG dinucleotides within a selected region. The expected dinucleotide frequency assuming there were no biases is 0.0625 for each dinucleotide combination (or 1/16 as there are 16 possible dincucleotide combinations) meaning that CG has an expected frequency of 0.0625. However this is not the case is most vertebrates as we see at the genome scale up to five-fold less than the expected frequency (this is true for humans). The exception is at CpG islands where there is generally a greater number of CGs than expected.
Observed to expected is the actual observed counts of CG within the strech over the expected counts. compseqcompseq is a good program to work with for all dinucleotide combinations.
Your estimate is correct and within a CpG island you routinely achieve values greater than 1 indicating a high concentration of CG dinucleotides.
All the best.
Thanks for clearing that up nick. Maybe you could help answer another round of questions. The method of screening which we are considering using is to use melting curve analysis (mainly because our lab deals alot in qRT-PCR) of pre- and post- bisulfite modification to DNA to determine the approximate percentage of methylated DNA (within the total population) in a particular amplified CpG island due to the difference in GC content of the product.
My questions are: Is there any particular reason why just sequencing (which I understand to be semi-quantitative) alone pre- and post- modification would be insufficient to answer this question? Is it even possible to do normal sequencing on bisulfite treated DNA (mainly to check for complete conversion)? or is it necessary to go with the pyrosequencing route?
If pyrosequencing is necessary, can you recommend another way to check for complete conversion of bisulfite? Would methylation site specific restriction enzymes do the trick?
this is an interesting spin on a technique, to perform a melting curve analysis. I have not done such analyses before but I would say, in order to measure methylation levels by melting curve I would say you will have to construct a standard curve from proucts of the same region for known methylation so you can then use this to measure your tests. Post bisulfite DNA will always have a lower melting temperature, is it sensitive enough to detect a one base difference of a C and a T?
Another thing, such an analysis will tell you if your product is methylated or not and it will not be able to tell you exactly where the methylation is, that is if it's at the Sp1 site within the region or if it's at a CTCF binding site or whatever, that's the advantage of bisulfite sequencing. You can say at one particular CpG site, X% of the clones seqeunce were methylated.
Note it's semi quantitative because you are only sampling a population of molecules by selecting clones and then seqeuncing, you can however directly seqeunce without cloning and this will actually gie you a quantitative measure of methylation at a particular site.
you can of course check if your DNA is converted after the PCR step with restriction enzymes and details can be found here Bisulfite treatment
Pryoseqeuncing is another way to go and people are starting to use this along with mass spec to measure methylation, it gets more exciting by the day!
You've been really helpful nick. We have already worked out most of our controls and the so called "standard" that we are going to use using restriction enzyme site specific methylases on demethylated DNA (5-Aza) resulting in a standard with 0, 3, 7, 10, 17, and 27 CpG sites manually methylated. From the literature methylation at 1 site results in a shift of about .25-.5 degrees in the melting temp of the product. So hopefully comparing our unknowns to the standard curve we will have some idea as to the methylation status of the cpg island that we amplified.
One last question on primer design. Bisulfite treatment converts unmethylated cytosines to uracils. But a couple of the primer design guides that i have seen have mentioned that to make it easier on myself I should copy my sequence into a word document and replace all non-cpg cytosines with a T. I don't understand why? Shouldn't they all be converted to U's? or is there a further step after desulphonation which results in conversion from Uracil to Thymine? I think this is just a typo but am I missing something?
with the exception of methprimer, that does the conversion for you, you would need to convert all non-cpg's cytosines to Thymine.
yes you are correct to say that they actually convert to uracil, however in a PCR amplification thymine replaces uracil within the template (because you don't have dUTP in the reaction mixture!)
hope this clears this up for you.
I would be interested to see how you go with the melting curve analysis please keep us informed!
ok I guess Im just confused about the whole idea of CpG islands, dinucleotides, methylation, deamination etc. So:
CpG islands only refers to groups of CpG that are unmethylated. These CpG islands are found mostly near the promoter of genes, especially in housekeeping genes, and in a more broad sense they are found in GC rich isochores. is this correct? So CpG islands are then rare outside genes, like in non coding regions and in GC poor regions?
CpG dinucleotides are found in so called "islands", where they are hypomethylated but otherwise, is CpG is found in a methylated state? I guess I dont really understand where, when, why are CpG dinucleotides methylated or not. So outside these islands CpG is mostly methylated and prone to deaminations and C-T transitions? thats why CpG are underrepresented in the unmethylated state? (and outside the CpG islands")
Does deamination occurr througout the genome, or it is not particularly biased towards any region? e.g. are GC rich isochores less exposed to deamination? e.g. Fryxell and Moon (2005) argue that rate of 5mC deamination occurrs less often in GC rich regions.
What is driving methylation?
Is cytosine methylated more often in GC poor than GC rich regions? and why
Why are CpG islands unmethylated, what are the ideas on what created and maintains these unmethylated regions? why are they associated with GC rich isochores? Does GC content somehow prevents methylation?
any help would be very welcome!
I will try to answer some of your questions.
>>CpG islands only refers to groups of CpG that are unmethylated.
A: No, CpG islands (CGI) may be or not methylated.
>>These CpG islands are found mostly near the promoter of genes, especially in housekeeping genes, and in a more broad sense they are found in GC rich isochores. is this correct? So CpG islands are then rare outside genes, like in non coding regions and in GC poor regions?
A: CGI could be anywhere. Please see the following table from Jones Group
Table 1. Number of CpG islands in chromosomes 21 and 22*
Category 21 22 21 + 22
5' region 57 138 195
Exon 334 423 757
Alu repeats 2,520 5,131 7,651
Unknown 2,128 3,331 5,459
Total 5,039 9,023 14,062
[*CpG islands were categorized into four categories in this order: "5' region" included at least the first coding exon of a known gene and might or might not include downstream introns, exons and Alus. An "Exon" CpG island did not include a known first coding exon and possibly included intronic and Alu sequences. An "Alu" did not include a known exonic sequence. "Unknown" sequences did not satisfy any of the above criteria. ]
>>CpG dinucleotides are found in so called "islands", where they are hypomethylated but otherwise, is CpG is found in a methylated state? I guess I dont really understand where, when, why are CpG dinucleotides methylated or not. So outside these islands CpG is mostly methylated and prone to deaminations and C-T transitions? thats why CpG are underrepresented in the unmethylated state? (and outside the CpG islands")
A: any CG dinucleotides are called CpG site. Yes, they are usually methylated outside CGI and while are not methylated within CGI.
Cytosine is prone to deamination that yields uracil which can be efficiently removed by the mismatch repair machinery. In contrast, deamination of 5mC yields thymine - not only does this reaction occur more efficiently than the deamination of cytosine to uracil, but it also yields a mismatch (G/T) that is repaired much less efficiently. Thus, 5-methylcytosine is mutagenic and, consequently, the dinucleotide CpG occurs at considerably lower frequency in mammalian genomes than would be expected from simple combinatorial calculations based on GC content and is typically found within GC-rich areas termed CGI.
>>What is driving methylation?
Don't know yet.
thanks for your quick reply.
So, CpG islands can be methylated. Thats interesting, as several papers I found basically define CpG islands as being unmethylated. (Sved and Bird, 1990; Antequera and Bird, 1993; Bird, 2002; Miyamoto and Freire, 2000). So, do these methylated islands ocurr mostly in silenced genes? and I suppose they are much less common than unmethylated ones? and that all CpG islands associated with active genes are unmethylated?
As for their location. So, according to Jones, CGI are mostly found near genes or associated to Alu repeats. but also in neither, so yes, I guess they can be anywhere. But, I wonder, they methylation state is still dependent on their location? So, is it that unmethylated islands are mostly found in GC rich regions? GC rich isochores are associated to Alus, high gene density, and some of the other characteristics, including unmethylated CGIs (Bernardi, 2000). and it is further said that these islands are "inherently resistant to methylation". Any idea how that resistance works?
I can see unmethylated CGIs associated with active genes, I dont know much about their association with Alus but I m happy sticking to Takai and Jones definition of CGI that exclude tose asssociated with Alus. But I dont know how is it that unmethylated CGIs are protected from methylation in GC rich regions. is it maybe similar to what protect CpG from deamination? (DNA melting; Fryxell and Moon 2005).
Hi Hans, you have brought up some very valid questions.
methylated islands do occur at silenced genes, the best examples are from tumor suppressor genes in cancer, imprinted regions and of course genes found on the X-chromosome and subjected to X-inactivation.
there are hypothesis existing that a demethylase enzyme actively maintains the CpG islands in a hypomethylated state. No one has yet characterised this yet. There are insulator elements such as CTCF and BORIS that bind to hypomethylated GC-rich regions and has been thought to sterically inhibit the methyltransferases from methylating the region.
I think that active transcription of the CpG I assocoiated genes have within the complex, proteins that are refractory to DNMT localisation and thus inhibits methylation.
The underlying question you have asked will be answered by the epigenome project
found here. where every CpG site will be assayed for methylation. At the moment, the models presented by numerous groups are based on the best techniques available for global methylation studies employing methylation sensitive restriction enzymes, it is now becoming clear that MBD proteins are able to bind a single methylated CpG that does not necessarily have to reside in a CpG island, as MBDs are known to be associated with chromatin remodelling proteins and serve a function of maintaining a certain chromatin state, I feel that methylation certaiinly plays a role in chromatin maintenance with "side effects" of regulating gene expression.
Could you forward me the Fyyxel and moon citation? I am unable to find it through a pubmed search.