BLAST CLUST OR CD-HIT? - Which one can retrieve better results? (Jul/04/2006 )
Hi there,
I am trying to cluster some sequences,which can give me 25% sequence similarity. I was told to use blast clust but there seems to be a major problem with this algorithm    
When I try to execute blast clust, I use the following command:
blastclust -i sara.txt -o output.txt -p T -L .9 -b T -S 25
-i = my input fasta file called "sara.txt"
-o = my output file called "output.txt"
-p T = indicates input file contains protein sequences and not nucleotide sequnces
-L = area covering 90% of the length of sequence
-S = I want sequences which have 25% similarity
OK, so now this is what happens blastclust is executed:- 
[blastclust] WARNING: Non-unique or not-sorted string IDs found 2879 line: 'q9y6d5' 2878 line: 'q9y6d5'
[blastclust] WARNING: Non-unique or not-sorted string IDs found 2883 line: 'q9z0r4' 2882 line: 'q9z0r4'
[blastclust] WARNING: Non-unique or not-sorted string IDs found 2884 line: 'q9z0r4' 2883 line: 'q9z0r4'
[blastclust] WARNING: Non-unique or not-sorted string IDs found 2885 line: 'q9z0r4' 2884 line: 'q9z0r4'
Jul 4, 2006 11:40 AM Start clustering of 1444 queries
[blastclust] FATAL ERROR: Blastclust cannot process input files with non-unique sequence identifiers
These identifiers are ones from Uniprot/Swissprot so I don't understand how they are non-unique. 
I've been trying to get this thing working but no luck so I decided to used another algorithm called CD-HIT 
This algorithm does retrieve the results and is quite fast (does the job in 2 minutes - yippee!) but I can only go as low as 40% similarity where in fact I need sequences that have 25% homology (i.e. non - homologous sequences)
These are the results which CD-HIT gives me:
>Cluster 0
0	25943aa, >Q10466	MTTQAPTFTQ... *
>Cluster 1
0	4613aa, >Q9NU22	MEHFLLEVAA... *
1	4613aa, >Q9NU22	MEHFLLEVAA... at 100%
>Cluster 2
0	4279aa, >O14686	MDSQNLAGED... *
>Cluster 3
0	4055aa, >Q9QYX7	MGNEASLEGE... *
1	3066aa, >Q9QYX7	MGNEASLEGE... at 100%
2	4055aa, >Q9QYX7	MGNEASLEGE... at 100%
3	4055aa, >Q9QYX7	MGNEASLEGE... at 100%
4	4055aa, >Q9QYX7	MGNEASLEGE... at 100%
5	4055aa, >Q9QYX7	MGNEASLEGE... at 100%
6	4055aa, >Q9QYX7	MGNEASLEGE... at 100%
7	4055aa, >Q9QYX7	MGNEASLEGE... at 100%
8	4055aa, >Q9QYX7	MGNEASLEGE... at 100%
9	4055aa, >Q9QYX7	MGNEASLEGE... at 100%
10	4055aa, >Q9QYX7	MGNEASLEGE... at 100%
11	4055aa, >Q9QYX7	MGNEASLEGE... at 100%
>Cluster 4
0	3984aa, >Q92736	MADGGEGEDE... at 42%
1	4054aa, >P11716	MGDGGEGEDE... *
>Cluster 5
0	3701aa, >Q15149	MVAGMLMPRD... at 93%
1	3701aa, >Q15149	MVAGMLMPRD... at 93%
2	3703aa, >Q6S389	MVAGMLMPLD... *
3	3701aa, >Q15149	MVAGMLMPRD... at 93%
4	3701aa, >Q15149	MVAGMLMPRD... at 93%
5	3701aa, >Q15149	MVAGMLMPRD... at 93%
>Cluster 6
0	3391aa, >Q7Z6Z7	MKVDRTKLKK... *
1	3391aa, >Q7Z6Z7	MKVDRTKLKK... at 100%
2	3391aa, >Q7Z6Z7	MKVDRTKLKK... at 100%
>Cluster 7
0	3145aa, >P78527	MAGSGAGVRC... *
>Cluster 8
0	2986aa, >Q03164	MAHSCRWRFP... *
1	2986aa, >Q03164	MAHSCRWRFP... at 100%
2	2986aa, >Q03164	MAHSCRWRFP... at 100%
3	2986aa, >Q03164	MAHSCRWRFP... at 100%
4	2986aa, >Q03164	MAHSCRWRFP... at 100%
>Cluster 9
0	2958aa, >O88737	GNEASLEGGA... *
1	2958aa, >O88737	GNEASLEGGA... at 100%
2	2958aa, >O88737	GNEASLEGGA... at 100%
3	2958aa, >O88737	GNEASLEGGA... at 100%
4	2958aa, >O88737	GNEASLEGGA... at 100%
5	2958aa, >O88737	GNEASLEGGA... at 100%
6	2958aa, >O88737	GNEASLEGGA... at 100%
7	2958aa, >O88737	GNEASLEGGA... at 100%
8	2958aa, >O88737	GNEASLEGGA... at 100%
9	2958aa, >O88737	GNEASLEGGA... at 100%
10	2958aa, >O88737	GNEASLEGGA... at 100%
11	2958aa, >O88737	GNEASLEGGA... at 100%
12	2958aa, >O88737	GNEASLEGGA... at 100%
13	2958aa, >O88737	GNEASLEGGA... at 100%
14	2958aa, >O88737	GNEASLEGGA... at 100%
15	2958aa, >O88737	GNEASLEGGA... at 100%
16	2958aa, >O88737	GNEASLEGGA... at 100%
17	2958aa, >O88737	GNEASLEGGA... at 100%
18	2958aa, >O88737	GNEASLEGGA... at 100%
19	2958aa, >O88737	GNEASLEGGA... at 100%
20	2958aa, >O88737	GNEASLEGGA... at 100%
21	2958aa, >O88737	GNEASLEGGA... at 100%
22	2958aa, >O88737	GNEASLEGGA... at 100%
>Cluster 10
0	2941aa, >Q01484	MMNEDAAQKS... *
1	2941aa, >Q01484	MMNEDAAQKS... at 100%
2	2941aa, >Q01484	MMNEDAAQKS... at 100%
3	2941aa, >Q01484	MMNEDAAQKS... at 100%
4	2941aa, >Q01484	MMNEDAAQKS... at 100%
5	2941aa, >Q01484	MMNEDAAQKS... at 100%
6	236aa, >Q8C8R3	MASPTSPGPE... at 69%
7	236aa, >Q8C8R3	MASPTSPGPE... at 69%
8	236aa, >Q8C8R3	MASPTSPGPE... at 69%
>Cluster 11
0	2876aa, >Q9Y4A5	MAFVATQGAT... *
>Cluster 12
0	2435aa, >P51587	MPIGSKERPT... *
>Cluster 13
0	2273aa, >P46013	MWPTRRLVTI... *
1	2273aa, >P46013	MWPTRRLVTI... at 100%
2	2273aa, >P46013	MWPTRRLVTI... at 100%
>Cluster 14
0	2241aa, >P49792	MRRSKADVER... *
1	2241aa, >P49792	MRRSKADVER... at 100%
2	2241aa, >P49792	MRRSKADVER... at 100%
3	2241aa, >P49792	MRRSKADVER... at 100%
4	2241aa, >P49792	MRRSKADVER... at 100%
5	2241aa, >P49792	MRRSKADVER... at 100%
6	2241aa, >P49792	MRRSKADVER... at 100%
>Cluster 15
0	2073aa, >Q13315	MSLVLNDLLI... *
>Cluster 16
0	24aa, >Q13523	MAAAETQSLR... at 45%
1	24aa, >Q13523	MAAAETQSLR... at 45%
2	24aa, >Q13523	MAAAETQSLR... at 45%
3	24aa, >Q13523	MAAAETQSLR... at 45%
4	24aa, >Q13523	MAAAETQSLR... at 45%
5	24aa, >Q13523	MAAAETQSLR... at 45%
6	24aa, >Q13523	MAAAETQSLR... at 45%
7	24aa, >Q13523	MAAAETQSLR... at 45%
8	24aa, >Q13523	MAAAETQSLR... at 45%
9	24aa, >Q13523	MAAAETQSLR... at 45%
10	24aa, >Q13523	MAAAETQSLR... at 45%
11	24aa, >Q13523	MAAAETQSLR... at 45%
12	24aa, >Q13523	MAAAETQSLR... at 45%
13	24aa, >Q13523	MAAAETQSLR... at 45%
14	24aa, >Q13523	MAAAETQSLR... at 45%
15	24aa, >Q13523	MAAAETQSLR... at 45%
16	24aa, >Q61136	MAATEPPSLR... at 45%
17	2055aa, >O75962	MKAMDVLPIL... *
> = starts a new cluster
* = this sequence is a representative of this cluster
% = the identity between this sequence and the representative
How can I get sequences with 25% homology using the results from CD-HIT? Or does anyone know how I can get blastclust working. 
Would it be possible if I sent my FASTA file to anyone (a professional or bioinformatician) to try and get blastclust to work? 
Any help or suggestions would be very much appreciated. 
Thank you
Sara  
Hello.
This can be a bit of a pain.  I had exactly the same problem some time ago. Sadly I can't remember how I fixed it!  One thing you can try is to make all of the sequence names unique by writing a [perl] script to rename the sequences to numbers: 1..n they should then be unique.
The alternative is to use muscle - I am fairly certain you can use it to cluster sequences.
The error is not major, it makes perfect sense if you think about it.   
Are you working with structures too?  If so I have another idea.
if you can post the file here I will have a look or email to <my username>at googlemail.com and put up a post here telling me, otherwise I probably won't check.
This can be a bit of a pain. I had exactly the same problem some time ago. Sadly I can't remember how I fixed it! One thing you can try is to make all of the sequence names unique by writing a [perl] script to rename the sequences to numbers: 1..n they should then be unique.
The alternative is to use muscle - I am fairly certain you can use it to cluster sequences.
The error is not major, it makes perfect sense if you think about it.
Are you working with structures too? If so I have another idea.
if you can post the file here I will have a look.
Hi perlmunky,
I don't really want to rename the sequences as I'll be needing them to link to the gene ontology database. Imagine trying to search gene ontology without a unique identifier....it'll be a nightmare!
Do you think CD-HIT is a waste of time? I'm kinda getting frustrated with clustering as I've spent the past 3 weeks trying to figure blastclust. I'm not working with structures at the moment but I may in the next few weeks.
I'm attaching my FASTA file with this post.
I really appreciate your help on this. Cheers!
Sara
P.S. I've emailed you a copy of the FASTA file.
