Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

software for promoter analysis of cis binding sites? - identifying cis-elements in promoter sequences (Jul/07/2005 )

Pages: 1 2 Next

Hi does anybody know of a free site for analysing, naming/identifiying cis-acting binding sites in promoter regions? i would like to be able to determine the binding sequences and the possible proteins that bind to those sequences in the promoter regions, it would be really helpful. sad.gif
ive also tried tfsearch but im uncertain as i have compared the same promoter region with another program which requires purchasing (genomatix) and although there is some similarities the results are generally different and now i dont know which to believe.

-ajp-

Try this: Transcription Element Search System (TESS) - String Search Page

http://www.cbil.upenn.edu/cgi-bin/tess/tess?RQ=SEA-FR-QueryS

-pcrman-

Also try TRANSFAC MatInspector. I would imagine TFSearch result different from others. The reasons are below:
1. TFSearch algorithm is none-statistic based, strictly in favor of match max-scored matrix;
2. TFSearch matrix is not up to date, I believe it is transfac 3.2 or lower. whereas others such as TESS is much higher, should be 5.0 or 6.0, 7.0+ from transfac are charging fees or require registration.

I think you cannot soly rely on any of these search algorithms if you really want to find sth. One suggestion I always make is that when searching for transcription factor binding sites, you would always want to incorporate a random model to reduce false positive rate, which is notorous in these types of analysis.

-cyberpostdoc-

sorry cyberpostdoc, im fairly new at this and unsure what you mean by "incorporate a random model to reduce false positive rate", would you be kind enough to explain a little further?

-ajp-

QUOTE (ajp @ Jul 7 2005, 11:41 PM)
sorry cyberpostdoc, im fairly new at this and unsure what you mean by "incorporate a random model to reduce false positive rate", would you be kind enough to explain a little further?


Sure, one of the biggest problem of searching for transcription factor binding site (TFBS) is that too much false positive. There are essentially two types of search strategies: 1. string matching through regular expression; 2. use a positional specific matrix and a scoring function to search. Either way, there maybe a lot of hits that are not really TFBS used in vivo. This is mostly related to the DNA compositions, and di-, tri-, tetra ... nucleotide compositions of your searched sequences. Thus it is crutial that you use a random model to filter the result, only look for statistically significant hits. Otherwise, after searching through computer programs, you might have too many putative binding sites to handle.

To be short, a random model filtering is to run the software package you used to search for TFBS not only on your targeted sequences, but also on a set of random sequences, some binding sites might hits both in your targeted sequences and random sequences, in which case, you know they are probably not significant (for examples, GATA series might hits everywhere), other hits which only enriched in your target sites, maybe what you want to look into further.

Now, there are essentially 2 types of random model, one is to general random sequence de novo. This is usually done by a computer program, random pick ATGCs and concatenate them into a length same as target sequence. In this case, you might want to control the ATGC composition to be the same as your target sequence, just shuffle bases. You can do it assuming no dependancies among nucleotides, which is called zero-order markov chain. You can also be sophiscated where you want to keep di- tri- tetra- nucleotide composition camparable between target and control(random) sequences, which are so called 2nd, 3rd, 4th-order markov chains. The other type of random model is simple, say, you target sequence has 200 sequences from where you want to find a set of enriched TFBS that might be dictating the co-expression pattern. You simplly pull out their promoter regions and search TFBSs. Now, you can just random pick another 200 genes from the genome and do the same thing, and again you look for hits that only enriched in your target set, whereas filter out hits in both random and target sequences.

Statistical tests can be applied mostly are Chi-squared tests and T-tests.

-cyberpostdoc-

TF biding sites are quite short sometimes. Throw enough random sequence at a program and it will get "hits". Say you want to compare the upstream sequences of 20 genes so you grab 2000 bp upstream of each one thats 40k bp of sequence you will just randomly find "hits". Also there are a lot of ubiquitous TF sequences that may not mean a whole lot biologically since they are found on a lot of genes.
Some TF analysis programs take this into accont and wil compare your hits to random generated sequences and also known promoter frequencies and try to weed out random sequence hits and ubiquitous or non significant matches.

-cip-

If TESS gives a score for the hits and you choose just the high scoring hits (i.e. >14), would doing the random sequence hits help? I suppose my question really is: How reliable is the TESS? I'm having the same problem.

-ggUss-

QUOTE (ggUss @ Jul 9 2005, 04:10 AM)
If TESS gives a score for the hits and you choose just the high scoring hits (i.e. >14), would doing the random sequence hits help? I suppose my question really is: How reliable is the TESS? I'm having the same problem.


Bioinformatics predictions are very vulnerble for false positives, I don't think you can trust all of them. Random control should be done in most cases.

One thing is real, some hits might be really the concensus for a binding sites, but who knows if in vivo it will get used, when it really comes into biology, nature is the boss wink.gif

-cyberpostdoc-

QUOTE (cyberpostdoc @ Jul 13 2005, 11:37 AM)
QUOTE (ggUss @ Jul 9 2005, 04:10 AM)
If TESS gives a score for the hits and you choose just the high scoring hits (i.e. >14), would doing the random sequence hits help? I suppose my question really is: How reliable is the TESS? I'm having the same problem.


Bioinformatics predictions are very vulnerble for false positives, I don't think you can trust all of them. Random control should be done in most cases.

One thing is real, some hits might be really the concensus for a binding sites, but who knows if in vivo it will get used, when it really comes into biology, nature is the boss wink.gif



What if my query input sequence is 5kb long? How do I design a random sequence of this length to include as a control in TESS? Thanks in advance.

-ggUss-

QUOTE (ggUss @ Jul 13 2005, 08:17 AM)
What if my query input sequence is 5kb long? How do I design a random sequence of this length to include as a control in TESS? Thanks in advance.


1. create 100 random sequence of 5kb which have same ATGC composition as your input sequence;
2. search TESS for TFBS using these 100 rdm sequence, compile a statistics of hits;
3. compare the hits statistics in rdm sequences to your input sequences to determine which hits might be interesting.

smile.gif

-cyberpostdoc-

Pages: 1 2 Next