Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

CD-HIT documentation - (Aug/30/2006 )

Hi,

I am new to unix environment and hence installation and getting to work a clustering program called CD-HIT is becoming a problem.....

the installation guide does not give step by step procedures......and hence i am having trouble.

I need some help desperately..........anybody experienced in working with CD-HIT.??/

thanks a lot.....

-kamesh-

Hi,

There's no need to use "make" on Unix platforms. Just unpack the files as usual after installation and load the terminal. Once you're in the correct working directory run CD-HIT as follows:

cd-hit -i input.txt -o output.txt -c 0.4 -n 2.

where :

-i = input file containing FASTA formatted sequences
-o = output file
=c = threshold level (0.4 in this case, which is the lowest threshold level)
-n = words ranging in threshold level (2 words at 40% identity)

Job done! You'll get an output file containing clusters which have a certain percent identity, with the lowest being 40% in this example.

However, I'd recommend you using BLASTClust as it's much more efficient. It's available to download from the NCBI website.

Hope this helps

Good Luck!

-sara.pl-

QUOTE (sara.pl @ Aug 30 2006, 05:15 PM)
Hi,

Thanks a lot. The installation worked and it got the job done in a jiffy.

Theres another problem that i have. My protein database that i want to cd-hit has nearly 15,852 seq. However, the cd-hit reads only 15, 843 seq, as is evident from the total sequences display that shows up once when the cd-hit commands are hit.(and this is even before the clustering begins).

The clustering of the 15,843 seq, (instead of 15,852) however happens.

Can you think of what might be happening??/

thanks.


here's no need to use "make" on Unix platforms. Just unpack the files as usual after installation and load the terminal. Once you're in the correct working directory run CD-HIT as follows:

cd-hit -i input.txt -o output.txt -c 0.4 -n 2.

where :

-i = input file containing FASTA formatted sequences
-o = output file
=c = threshold level (0.4 in this case, which is the lowest threshold level)
-n = words ranging in threshold level (2 words at 40% identity)

Job done! You'll get an output file containing clusters which have a certain percent identity, with the lowest being 40% in this example.

However, I'd recommend you using BLASTClust as it's much more efficient. It's available to download from the NCBI website.

Hope this helps

Good Luck!

-kamesh-

QUOTE (kamesh @ Aug 31 2006, 06:43 AM)
Hi,

Thanks a lot. The installation worked and it got the job done in a jiffy.

Theres another problem that i have. My protein database that i want to cd-hit has nearly 15,852 seq. However, the cd-hit reads only 15, 843 seq, as is evident from the total sequences display that shows up once when the cd-hit commands are hit.(and this is even before the clustering begins).

The clustering of the 15,843 seq, (instead of 15,852) however happens.

Can you think of what might be happening??/

thanks.


Hi,

Glad it worked. I'm not really sure why that's happening. I only use CD-HIT for around 2000 sequences or less. It may be skipping some seqs as it may not be recognising them as unique. Are you sure there aren't 2 or more of the same sequence? Because it won't recognise it.

-sara.pl-

QUOTE (sara.pl @ Aug 31 2006, 04:47 AM)
QUOTE (kamesh @ Aug 31 2006, 06:43 AM)


Hi,

Thanks a lot. The installation worked and it got the job done in a jiffy.

Theres another problem that i have. My protein database that i want to cd-hit has nearly 15,852 seq. However, the cd-hit reads only 15, 843 seq, as is evident from the total sequences display that shows up once when the cd-hit commands are hit.(and this is even before the clustering begins).

The clustering of the 15,843 seq, (instead of 15,852) however happens.

Can you think of what might be happening??/

thanks.


Hi,

Glad it worked. I'm not really sure why that's happening. I only use CD-HIT for around 2000 sequences or less. It may be skipping some seqs as it may not be recognising them as unique. Are you sure there aren't 2 or more of the same sequence? Because it won't recognise it.


Hi,

I now came to know that the missing sequences are below the "the throw away size" which by default it seems is around 20 a.acids. The userguide/readme/help file says something about adjusting this throw away length. Generally it looks like that the clustering of small seq is less stable than for long sequences...smile.gif

-kamesh-

QUOTE (kamesh @ Aug 31 2006, 02:05 PM)
Hi,

I now came to know that the missing sequences are below the "the throw away size" which by default it seems is around 20 a.acids. The userguide/readme/help file says something about adjusting this throw away length. Generally it looks like that the clustering of small seq is less stable than for long sequences...smile.gif


Right...so you'll have to remove the smaller sequences in order for CD-HIT to work properly. Welcome to the world of bioinformatics rolleyes.gif

GOOD LUCK!

-sara.pl-