BLAST CLUST OR CD-HIT? - Which one can retrieve better results? (Jul/04/2006 )

Hi there,

I am trying to cluster some sequences,which can give me 25% sequence similarity. I was told to use blast clust but there seems to be a major problem with this algorithm

When I try to execute blast clust, I use the following command:

blastclust -i sara.txt -o output.txt -p T -L .9 -b T -S 25

-i = my input fasta file called "sara.txt"
-o = my output file called "output.txt"
-p T = indicates input file contains protein sequences and not nucleotide sequnces
-L = area covering 90% of the length of sequence
-S = I want sequences which have 25% similarity

OK, so now this is what happens blastclust is executed:-

[blastclust] WARNING: Non-unique or not-sorted string IDs found 2879 line: 'q9y6d5' 2878 line: 'q9y6d5'
[blastclust] WARNING: Non-unique or not-sorted string IDs found 2883 line: 'q9z0r4' 2882 line: 'q9z0r4'
[blastclust] WARNING: Non-unique or not-sorted string IDs found 2884 line: 'q9z0r4' 2883 line: 'q9z0r4'
[blastclust] WARNING: Non-unique or not-sorted string IDs found 2885 line: 'q9z0r4' 2884 line: 'q9z0r4'
Jul 4, 2006 11:40 AM Start clustering of 1444 queries
[blastclust] FATAL ERROR: Blastclust cannot process input files with non-unique sequence identifiers

These identifiers are ones from Uniprot/Swissprot so I don't understand how they are non-unique.

I've been trying to get this thing working but no luck so I decided to used another algorithm called CD-HIT

This algorithm does retrieve the results and is quite fast (does the job in 2 minutes - yippee!) but I can only go as low as 40% similarity where in fact I need sequences that have 25% homology (i.e. non - homologous sequences)

These are the results which CD-HIT gives me:

>Cluster 0
0 25943aa, >Q10466 MTTQAPTFTQ... *
>Cluster 1
0 4613aa, >Q9NU22 MEHFLLEVAA... *
1 4613aa, >Q9NU22 MEHFLLEVAA... at 100%
>Cluster 2
0 4279aa, >O14686 MDSQNLAGED... *
>Cluster 3
0 4055aa, >Q9QYX7 MGNEASLEGE... *
1 3066aa, >Q9QYX7 MGNEASLEGE... at 100%
2 4055aa, >Q9QYX7 MGNEASLEGE... at 100%
3 4055aa, >Q9QYX7 MGNEASLEGE... at 100%
4 4055aa, >Q9QYX7 MGNEASLEGE... at 100%
5 4055aa, >Q9QYX7 MGNEASLEGE... at 100%
6 4055aa, >Q9QYX7 MGNEASLEGE... at 100%
7 4055aa, >Q9QYX7 MGNEASLEGE... at 100%
8 4055aa, >Q9QYX7 MGNEASLEGE... at 100%
9 4055aa, >Q9QYX7 MGNEASLEGE... at 100%
10 4055aa, >Q9QYX7 MGNEASLEGE... at 100%
11 4055aa, >Q9QYX7 MGNEASLEGE... at 100%
>Cluster 4
0 3984aa, >Q92736 MADGGEGEDE... at 42%
1 4054aa, >P11716 MGDGGEGEDE... *
>Cluster 5
0 3701aa, >Q15149 MVAGMLMPRD... at 93%
1 3701aa, >Q15149 MVAGMLMPRD... at 93%
2 3703aa, >Q6S389 MVAGMLMPLD... *
3 3701aa, >Q15149 MVAGMLMPRD... at 93%
4 3701aa, >Q15149 MVAGMLMPRD... at 93%
5 3701aa, >Q15149 MVAGMLMPRD... at 93%
>Cluster 6
0 3391aa, >Q7Z6Z7 MKVDRTKLKK... *
1 3391aa, >Q7Z6Z7 MKVDRTKLKK... at 100%
2 3391aa, >Q7Z6Z7 MKVDRTKLKK... at 100%
>Cluster 7
0 3145aa, >P78527 MAGSGAGVRC... *
>Cluster 8
0 2986aa, >Q03164 MAHSCRWRFP... *
1 2986aa, >Q03164 MAHSCRWRFP... at 100%
2 2986aa, >Q03164 MAHSCRWRFP... at 100%
3 2986aa, >Q03164 MAHSCRWRFP... at 100%
4 2986aa, >Q03164 MAHSCRWRFP... at 100%
>Cluster 9
0 2958aa, >O88737 GNEASLEGGA... *
1 2958aa, >O88737 GNEASLEGGA... at 100%
2 2958aa, >O88737 GNEASLEGGA... at 100%
3 2958aa, >O88737 GNEASLEGGA... at 100%
4 2958aa, >O88737 GNEASLEGGA... at 100%
5 2958aa, >O88737 GNEASLEGGA... at 100%
6 2958aa, >O88737 GNEASLEGGA... at 100%
7 2958aa, >O88737 GNEASLEGGA... at 100%
8 2958aa, >O88737 GNEASLEGGA... at 100%
9 2958aa, >O88737 GNEASLEGGA... at 100%
10 2958aa, >O88737 GNEASLEGGA... at 100%
11 2958aa, >O88737 GNEASLEGGA... at 100%
12 2958aa, >O88737 GNEASLEGGA... at 100%
13 2958aa, >O88737 GNEASLEGGA... at 100%
14 2958aa, >O88737 GNEASLEGGA... at 100%
15 2958aa, >O88737 GNEASLEGGA... at 100%
16 2958aa, >O88737 GNEASLEGGA... at 100%
17 2958aa, >O88737 GNEASLEGGA... at 100%
18 2958aa, >O88737 GNEASLEGGA... at 100%
19 2958aa, >O88737 GNEASLEGGA... at 100%
20 2958aa, >O88737 GNEASLEGGA... at 100%
21 2958aa, >O88737 GNEASLEGGA... at 100%
22 2958aa, >O88737 GNEASLEGGA... at 100%
>Cluster 10
0 2941aa, >Q01484 MMNEDAAQKS... *
1 2941aa, >Q01484 MMNEDAAQKS... at 100%
2 2941aa, >Q01484 MMNEDAAQKS... at 100%
3 2941aa, >Q01484 MMNEDAAQKS... at 100%
4 2941aa, >Q01484 MMNEDAAQKS... at 100%
5 2941aa, >Q01484 MMNEDAAQKS... at 100%
6 236aa, >Q8C8R3 MASPTSPGPE... at 69%
7 236aa, >Q8C8R3 MASPTSPGPE... at 69%
8 236aa, >Q8C8R3 MASPTSPGPE... at 69%
>Cluster 11
0 2876aa, >Q9Y4A5 MAFVATQGAT... *
>Cluster 12
0 2435aa, >P51587 MPIGSKERPT... *
>Cluster 13
0 2273aa, >P46013 MWPTRRLVTI... *
1 2273aa, >P46013 MWPTRRLVTI... at 100%
2 2273aa, >P46013 MWPTRRLVTI... at 100%
>Cluster 14
0 2241aa, >P49792 MRRSKADVER... *
1 2241aa, >P49792 MRRSKADVER... at 100%
2 2241aa, >P49792 MRRSKADVER... at 100%
3 2241aa, >P49792 MRRSKADVER... at 100%
4 2241aa, >P49792 MRRSKADVER... at 100%
5 2241aa, >P49792 MRRSKADVER... at 100%
6 2241aa, >P49792 MRRSKADVER... at 100%
>Cluster 15
0 2073aa, >Q13315 MSLVLNDLLI... *
>Cluster 16
0 24aa, >Q13523 MAAAETQSLR... at 45%
1 24aa, >Q13523 MAAAETQSLR... at 45%
2 24aa, >Q13523 MAAAETQSLR... at 45%
3 24aa, >Q13523 MAAAETQSLR... at 45%
4 24aa, >Q13523 MAAAETQSLR... at 45%
5 24aa, >Q13523 MAAAETQSLR... at 45%
6 24aa, >Q13523 MAAAETQSLR... at 45%
7 24aa, >Q13523 MAAAETQSLR... at 45%
8 24aa, >Q13523 MAAAETQSLR... at 45%
9 24aa, >Q13523 MAAAETQSLR... at 45%
10 24aa, >Q13523 MAAAETQSLR... at 45%
11 24aa, >Q13523 MAAAETQSLR... at 45%
12 24aa, >Q13523 MAAAETQSLR... at 45%
13 24aa, >Q13523 MAAAETQSLR... at 45%
14 24aa, >Q13523 MAAAETQSLR... at 45%
15 24aa, >Q13523 MAAAETQSLR... at 45%
16 24aa, >Q61136 MAATEPPSLR... at 45%
17 2055aa, >O75962 MKAMDVLPIL... *

> = starts a new cluster
* = this sequence is a representative of this cluster
% = the identity between this sequence and the representative

How can I get sequences with 25% homology using the results from CD-HIT? Or does anyone know how I can get blastclust working.

Would it be possible if I sent my FASTA file to anyone (a professional or bioinformatician) to try and get blastclust to work?

Any help or suggestions would be very much appreciated.

Thank you

Sara

-sara.pl-

Hello.

This can be a bit of a pain. I had exactly the same problem some time ago. Sadly I can't remember how I fixed it! One thing you can try is to make all of the sequence names unique by writing a [perl] script to rename the sequences to numbers: 1..n they should then be unique.

The alternative is to use muscle - I am fairly certain you can use it to cluster sequences.

The error is not major, it makes perfect sense if you think about it.

Are you working with structures too? If so I have another idea.

if you can post the file here I will have a look or email to <my username>at googlemail.com and put up a post here telling me, otherwise I probably won't check.

-perlmunky-

QUOTE (perlmunky @ Jul 4 2006, 03:49 PM)

Are you working with structures too? If so I have another idea.

if you can post the file here I will have a look.

Hi perlmunky,

I don't really want to rename the sequences as I'll be needing them to link to the gene ontology database. Imagine trying to search gene ontology without a unique identifier....it'll be a nightmare!

Do you think CD-HIT is a waste of time? I'm kinda getting frustrated with clustering as I've spent the past 3 weeks trying to figure blastclust. I'm not working with structures at the moment but I may in the next few weeks.

I'm attaching my FASTA file with this post.

I really appreciate your help on this. Cheers!

Sara

P.S. I've emailed you a copy of the FASTA file.

-sara.pl-