transcription factor binding site prediction, identification - (Mar/05/2014 )
I am a first year Phd student and I am dealing with cellular immunology. I realized my project needs some bioinformatic DATA to design experiments well.
I have one target gene (RUNX3) and two transcription factors (KLF4; PU.1) of interest. I can find the gene (RUNX3) on chromosome 1 and also retrieve of it`s and the promoter`s sequence. I would like to investigate how many potential binding sites are available on the the two promoters (P1 and P2) of the gene (RUNX3). A paper published data about two promoters exist for this gene. But one of the databases I checked for promoters (prediction on a given loci) gave me 3 promoters. What should I follow? My aim is to check the transcription factor binding sites on the promoter regions of the gene. I found some core binding sequence motifs within the CDS – are these artifacts or can be active binding sites (like duons or something) – how can I check it in silico? If I am right enhancer elements can also contain potential binding sites – how can I predict or determine the length and the position of the enhancer region of this gene (RNUX3) before the promoter region and right after the coding sequence? How should I determine the extra amount of DNA what I should add as extra before and after the gene to check as enhancer region? I have some data about one of the promoters (CHIP-seq) and primers. Is there any software which is able to give me the DNA sequence which is situated within the primers or I have to do it manually?
To summarize my questions:
- How can I validate in silico the predicted promoters of a gene?
- I don`t have access to Transfac – which software would you recommend to use for prediction and identification of transcription factor binding sites at a given region of DNA?
- How can I validate in-silico transcription factor binding sites within the CDS region?
- How can I predict the enhancer region of a gene (upstream and downstream direction) or just simply how many extra basepairs should I add in u. and d. direction to the gene when I want to get the DNA (to be sure I cover the enhancer regions) to identify potential binding sites?
Thank you very much!!!
Ok, i'll have a go at answering what i can...
so, you need the sequence of the promoter of runx3. to search for the promoter region, you need to go to NCBI and search the nucleotide database using the goi name, promoter, or 5 flanking. this will give you a lot of sequences, and you need to go through them to see which one is appropriate for you. the promoter sequences in GenBank are submitted by researchers and are usually characterized.
or, you could go to ensembl genome browser... for runx3: http://www.ensembl.org/Homo_sapiens/Location/View?db=core;g=ENSG00000020633;r=1:25226002-25291612
here you should be able to see the different introns and exons.
Or my favourite, gene cards, and look up the promoter sequence from there. http://www.genecards.org/cgi-bin/carddisp.pl?gene=RUNX3
it's possible for genes to have multiple promoter sites... especially if they are alternatively spliced. i always thought runx3 had two promoters, see the Groner Lab for info on this. if you have identified another one, you should discuss it with your supervisor.
once you have the sequence, you need to have a general idea of the transcription factor binding sites. this means a literature search... and for pu.1 the putative binding site is GAGGAA. klf4 is RCRCCYY or CACC.
i then used to manually search for the binding site sequence in my gene of interst. so, I have no idea about which software to use. manually= with your eyes. (for a couple of genes, this is ok.... for all of them, it's a problem...but you're only looking at 1).
also keep an eye open for partial binding sites etc, as nothing is set in stone.
to validate... do a ChIP (wet lab). nothing you do in silico will validate actual binding. until you can show that the transcription factors are binding to the chromatin (via ChIP), all you have is a "potential" binding site.
the ChIP-seq data you have should be global - it should contain all the places where the transcription factor binds. from that, you should be able to see a pattern. this will include other binding site sequences, of preferential binding areas.
Thank you very much!!!