Does a gene have an associated CpG island? - (Oct/16/2007 )
I have a big list of genes (a few hundred) and would like to know how many of them are associated with CpG islands. Does anyone know of a database or something similiar where I can easily determine if a CpG island is near to the gene? It would take forever to look them up individually!
ucsc genome browser is your friend.
you can create a custom table with your list of genes and there is already a CpG island track in the browser and bob's your uncle.
saves you sitting there for hours looking at each one.
have you got a list of gene's I can do it for you!
Thanks Nick, I thought there would be something for this. However, I've had a look but can't figure out how to use it.
Could you give me some more specific details on which links to follow to find where I put my gene list?
you will need to have a text editor and Excel which will help.
goto the Tables link found in the middle near the top and there you can select tracks. all annotations within genome browser are in tracks, so there is one for refSeq genes, one for mRNA, one for repeats and so forth.
so you need to select RefSeq Genes or KnownGenes and then you can create a table intersection with the intersect line where in there you select another track and can ask what in CpGIslands overlaps with RefSeq genes? You can also ask what does not? and so forth.
Then your output. The summary button will tell you how many refseq genes are overlapping with Cpgislands. but then you can output to a BED format which is a tab-delimited file that can be opened in excel or a text editor for sorting.
The bed file containins the coordinates, and name of the gene as well as other information.
this is a start, it's a good thing to learn the genome browser.
Ok, I'm getting closer but haven't quite got it right...
In the table browser I have the following details:
clade:Vertebrate, genome:Human, assembley:Mar 06
group:Genes and Gene Prediction Tracks, track:RefSeq Genes
Identifiers: (pasted my list of accession numbers in here)
Intersection with cpgIslandExt
Output file type: BED
I hit 'get output' and 'get BED' and the result is 'No results returned from query".
I am assuming I've put something in the wrong box somewhere??
davo, at the moment you haveprobably told table browser to look within a certain co-ordinate......there is an option to tick genome it's in between track and intersection. I forgot to tell you to check that box to look at the whole genome.
you can get tricky and ask the browser to look chromosome by chromosome with this line.
Nope, I've got the check-box next to genome ticked.
When I upload my file with the accession numbers, it does give me an error reading, "Note: some of the identifiers (e.g. NM_138705.2) have no match in table refGene, field name or in alias table refLink, field name. Try the "describe table schema" button for more information about the table and field." Not sure if this is the problem though.
hmmm unless the refSeq track doesn't contain the pubmed id...but it should.
how are you uploading your ID's from a word doc file or from a txt file?
there should be stuff on the website that would explain it, otherwise, get in touch with the UCSC guys, they are quick to respond and are very helpful. Hiram Clawson helped me alot with custom track annotations,
Ok so I emailed UCSC and I got a reply. Nick you are right, they are only too willing to help!
I got some detailed instructions on what to do and now have some answers. From what I can make out there is a glitch in the system there somewhere which was causing me grief. However there was a way around it, "make a Custom Track of the
cpgIslandExt table, and intersect the refGene table with that".
I've now got my output file, but don't fully understand it. Below is a sample of the results.
chr2 48698451 48813788 NM_172196 0 + 48698517 48813568 0 9 87,102,124,56,85,590,261,90,253, 0,3042,3358,24593,27224,28644,51913,53782,115084,
I can see the first column is the chromosome number, the 2nd and 3rd match closely to the 7th and 8th respectively, the 4th column in the accession number, the 5th... no idea, the 6th is the DNA strand, and as for the remaining number I have no idea.
I assume these refer to the number of CpGs within an island? How do I interpret how many / if there is an associated CpG island from these numbers?
Sorry so for so many questions but all these bioinformatics sites never seem to be easy for the novice user.
the output is BED and there is a help file that describes what the coloumns represent.
I think your output is of the gene that overlaps with a CpG island (because the fourth column is the name column) . if you intersect the other way round the output would be the CpG Island.