CD-HIT documentation - (Aug/30/2006 )

Hi,

I am new to unix environment and hence installation and getting to work a clustering program called CD-HIT is becoming a problem.....

the installation guide does not give step by step procedures......and hence i am having trouble.

I need some help desperately..........anybody experienced in working with CD-HIT.??/

thanks a lot.....

-kamesh-

Hi,

There's no need to use "make" on Unix platforms. Just unpack the files as usual after installation and load the terminal. Once you're in the correct working directory run CD-HIT as follows:

cd-hit -i input.txt -o output.txt -c 0.4 -n 2.

where :

-i = input file containing FASTA formatted sequences
-o = output file
=c = threshold level (0.4 in this case, which is the lowest threshold level)
-n = words ranging in threshold level (2 words at 40% identity)

Job done! You'll get an output file containing clusters which have a certain percent identity, with the lowest being 40% in this example.

However, I'd recommend you using BLASTClust as it's much more efficient. It's available to download from the NCBI website.

Hope this helps

Good Luck!

-sara.pl-

QUOTE (sara.pl @ Aug 30 2006, 05:15 PM)

Hi,

Thanks a lot. The installation worked and it got the job done in a jiffy.

Theres another problem that i have. My protein database that i want to cd-hit has nearly 15,852 seq. However, the cd-hit reads only 15, 843 seq, as is evident from the total sequences display that shows up once when the cd-hit commands are hit.(and this is even before the clustering begins).

The clustering of the 15,843 seq, (instead of 15,852) however happens.

Can you think of what might be happening??/

thanks.

here's no need to use "make" on Unix platforms. Just unpack the files as usual after installation and load the terminal. Once you're in the correct working directory run CD-HIT as follows:

cd-hit -i input.txt -o output.txt -c 0.4 -n 2.

where :

-i = input file containing FASTA formatted sequences
-o = output file
=c = threshold level (0.4 in this case, which is the lowest threshold level)
-n = words ranging in threshold level (2 words at 40% identity)

Job done! You'll get an output file containing clusters which have a certain percent identity, with the lowest being 40% in this example.

However, I'd recommend you using BLASTClust as it's much more efficient. It's available to download from the NCBI website.

Hope this helps

Good Luck!

-kamesh-

QUOTE (kamesh @ Aug 31 2006, 06:43 AM)

Hi,

Glad it worked. I'm not really sure why that's happening. I only use CD-HIT for around 2000 sequences or less. It may be skipping some seqs as it may not be recognising them as unique. Are you sure there aren't 2 or more of the same sequence? Because it won't recognise it.

-sara.pl-

QUOTE (sara.pl @ Aug 31 2006, 04:47 AM)

QUOTE (kamesh @ Aug 31 2006, 06:43 AM)

Hi,

I now came to know that the missing sequences are below the "the throw away size" which by default it seems is around 20 a.acids. The userguide/readme/help file says something about adjusting this throw away length. Generally it looks like that the clustering of small seq is less stable than for long sequences...

-kamesh-

QUOTE (kamesh @ Aug 31 2006, 02:05 PM)

Right...so you'll have to remove the smaller sequences in order for CD-HIT to work properly. Welcome to the world of bioinformatics

GOOD LUCK!

-sara.pl-