Gene cross-checking - (Jul/20/2010 )
Hello, I was hoping you could assist me in a matter. I am trying to cross-check a list of around (700-4000) genes in C. elegans with homologous genes in humans and was wondering if there was a way to do it without individually searching for each gene? Your help would be greatly appreciated, thank you!
What do you mean by "cross-check"? Find the closest ortholog by BLAST?
HomeBrew on Jul 20 2010, 09:25 PM said:
I have the names but not the sequences of the genes in C. elegans. Finding the closest ortholog would be great, would BLAST work with just the names?
Thanks!
So, we have several things to accomplish:
1. We must find the protein sequences for the genes in your C. elegans list and save them somewhere.
2. We must turn the human proteome into a BLASTable database.
3. We must the BLAST the C. elegans protein sequences you're interested in against the human database.
4. We must save the results of these BLASTs in some kind of managable format.
The first thing you'll need is Perl. If you have it already, fine. If you don't, you can download it for free from ActiveState. Assuming you're on a Windows machine, download ActivePerl 5.8.9 from here and run the program to install Perl on your computer. The link is for Perl version 5.8.9; there are later versions (5.10.x and 5.12.x), but the script I've written for you below has only been tested with Perl 5.8, so get that for now...
Now, you'll need the C. elegans and H. sapiens protein sequences in FASTA format. You can get both these files from KEGG (Kyoto Encyclopedia of Genes and Genomes). Create a directory on your computer called whatever you like, say "C.elegans.data". Download the file "c.elegans.pep" from here, and the file called "h.sapiens.pep" from here, and save them to the C.elegans.data directory.
You'll also need the BLAST stand-alone executable files. Download the file blast-2.2.23-ia32-win32.exe from here (assuming you're using Windows). Save the file in your C.elegans.data directory. Run the blast-2.2.23-ia32-win32.exe file; it will create three directories (bin, data, and doc) within the C.elegans.data directory.
Put your list of C. elegans gene names in a text file, one name per line. So, for example, I just took ten random gene names from the c.elegans.pep file as follows:
W10C8.2
F41G3.2
Y54E2A.6
F53C11.3
C05D2.4
Y73B3A.13
Y111B2A.14
F09E10.8
C04F5.5
COX1
Save this file as "genelist.txt", also in the C.elegans.data directory.
Cut-and-paste this Perl script:
#!/usr/bin/perl -w
use strict;
$| = 1;
my $source_pep_file = 'c.elegans.pep';
my $output_subset_file = 'c.elegans.genelist.peps.fasta';
my $genelist_file = 'genelist.txt';
my $blast_input_fasta = 'h.sapiens.pep';
my $blast_db_name = 'h.sapiens';
my $blast_results_file = 'c.elegans-vs-h.sapiens.blast.tab';
my (%source_peps, $name, $comment);
print "Reading in source peptide file $source_pep_file.\n\n";
open (CE, $source_pep_file) or die "Can't find $source_pep_file: !$\n";
while (<CE>) {
if (/^>(cel:.+?)\s+(.+)\n/) {
($name, $comment) = ($1, $2);
$source_peps{$name}{comment} = $comment;
} else {
chomp;
$source_peps{$name}{seq} .= $_;
}
}
close (CE);
open (OUT, ">$output_subset_file") or die "Can't open $output_subset_file for writing: $!\n";
open (GL, $genelist_file) or die "Can't open $genelist_file: $!\n";
print "Reading in gene list file $genelist_file; writing peptide sequences to $output_subset_file.\n\n";
while (<GL>) {
next if (/^\s*$/);
chomp;
my $gene_name = "cel:" . $_;
if (exists $source_peps{$gene_name}) {
print OUT ">$gene_name $source_peps{$gene_name}{comment}\n" . fasta($source_peps{$gene_name}{seq});
} else {
print "Can't find a gene named $gene_name in the $source_pep_file file!\n\n";
}
}
close (GL);
close (OUT);
if (!(-e "$blast_db_name.psq")) {
print "Creating BLAST database $blast_db_name, using $blast_input_fasta as input.\n\n";
system("./bin/formatdb -i $blast_input_fasta -p T -n $blast_db_name");
}
print "BLASTing $output_subset_file against the $blast_db_name database; saving results in $blast_results_file\n\n";
system("./bin/blastall -m 8 -e 1e-03 -p blastp -d $blast_db_name -F F -i $output_subset_file -o $blast_results_file");
my $results;
{
local $/;
open (RES, $blast_results_file) or die "Can't open $blast_results_file: $!\n";
$results = <RES>;
close (RES);
}
$results = "#Query id\tSubject id\t% identity\talignment length\tmismatches\tgap openings\tq. start\tq. end\ts. start\ts. end\te-value\tbit score\n" . $results;
open (HDR, ">$blast_results_file") or die "Can't open $blast_results_file: $!\n";
print HDR $results;
close (HDR);
print "Done.\n";
sub fasta {
my $seq = shift;
my $length = length ($seq);
my ($i, $fasta);
my $len = 60;
my $out_pat = "A$len";
my $whole = int($length/$len) * $len;
for ($i = 0; $i < $whole; $i += $len) {
my $line = pack ($out_pat, substr($seq, $i, $len)) . "\n";
$fasta .= $line;
}
if (my $last = substr($seq, $i)) {
$fasta .= $last . "\n";
}
return $fasta;
}
and save this as a text file called "get_homologs.pl".
So here's where we are: Perl is installed on your machine, and the C.elegans.data directory contains the text files "c.elegans.pep", "h.sapiens.pep", "genelist.txt", and "get_homologs.pl", and has contained within it the subdirectories "bin", "data", and "doc" which were created by running (extracting) blast-2.2.23-ia32-win32.exe.
If all that's true, get a DOS window (in XP, go to Start->Run and type "cmd" in the "Open:" dialog box), navigate to the C.elegans.data directory, and execute the Perl script by typing "perl get_homologs.pl" (without the quotes) at the command prompt (>).
The script will create, among other things, a file called "c.elegans.genelist.peps.fasta" that contains the protein sequences of the C. elegans genes listed in your genelist.txt file, and "c.elegans-vs-h.sapiens.blast.tab" which contains the results of BLASTing these protein sequences against the H. sapiens database.
The results file contains tab-separated entries that look like this:
cel:W10C8.2 hsa:83439 54.32 81 35 1 186 266 341 419 2e-022 103
The column headings (also contained within the file) are:
Query id
Subject id
% identity
alignment length
mismatches
gap openings
q. start
q. end
s. start
s. end
e-value
bit score
Thus, this particular example is showing that the C. elegans W10C8.2 protein matches H. sapiens 83439 protein with 54.32% identity over an 81-amino acid stretch (query 186-266, subject 341-419) and the e-value of the hit is 2e-022.
If you'd like to generate the results file such that it contains the BLAST data in the more traditional format, change the "-m 8" switch in the blastall system call line in the Perl script to "-m 0"...
Hope this helps!
HomeBrew on Jul 21 2010, 02:27 PM said:
1. We must find the protein sequences for the genes in your C. elegans list and save them somewhere.
2. We must turn the human proteome into a BLASTable database.
3. We must the BLAST the C. elegans protein sequences you're interested in against the human database.
4. We must save the results of these BLASTs in some kind of managable format.
The first thing you'll need is Perl. If you have it already, fine. If you don't, you can download it for free from ActiveState. Assuming you're on a Windows machine, download ActivePerl 5.8.9 from here and run the program to install Perl on your computer. The link is for Perl version 5.8.9; there are later versions (5.10.x and 5.12.x), but the script I've written for you below has only been tested with Perl 5.8, so get that for now...
Now, you'll need the C. elegans and H. sapiens protein sequences in FASTA format. You can get both these files from KEGG (Kyoto Encyclopedia of Genes and Genomes). Create a directory on your computer called whatever you like, say "C.elegans.data". Download the file "c.elegans.pep" from here, and the file called "h.sapiens.pep" from here, and save them to the C.elegans.data directory.
You'll also need the BLAST stand-alone executable files. Download the file blast-2.2.23-ia32-win32.exe from here (assuming you're using Windows). Save the file in your C.elegans.data directory. Run the blast-2.2.23-ia32-win32.exe file; it will create three directories (bin, data, and doc) within the C.elegans.data directory.
Put your list of C. elegans gene names in a text file, one name per line. So, for example, I just took ten random gene names from the c.elegans.pep file as follows:
W10C8.2
F41G3.2
Y54E2A.6
F53C11.3
C05D2.4
Y73B3A.13
Y111B2A.14
F09E10.8
C04F5.5
COX1
Save this file as "genelist.txt", also in the C.elegans.data directory.
Cut-and-paste this Perl script:
#!/usr/bin/perl -w
use strict;
$| = 1;
my $source_pep_file = 'c.elegans.pep';
my $output_subset_file = 'c.elegans.genelist.peps.fasta';
my $genelist_file = 'genelist.txt';
my $blast_input_fasta = 'h.sapiens.pep';
my $blast_db_name = 'h.sapiens';
my $blast_results_file = 'c.elegans-vs-h.sapiens.blast.tab';
my (%source_peps, $name, $comment);
print "Reading in source peptide file $source_pep_file.\n\n";
open (CE, $source_pep_file) or die "Can't find $source_pep_file: !$\n";
while (<CE>) {
if (/^>(cel:.+?)\s+(.+)\n/) {
($name, $comment) = ($1, $2);
$source_peps{$name}{comment} = $comment;
} else {
chomp;
$source_peps{$name}{seq} .= $_;
}
}
close (CE);
open (OUT, ">$output_subset_file") or die "Can't open $output_subset_file for writing: $!\n";
open (GL, $genelist_file) or die "Can't open $genelist_file: $!\n";
print "Reading in gene list file $genelist_file; writing peptide sequences to $output_subset_file.\n\n";
while (<GL>) {
next if (/^\s*$/);
chomp;
my $gene_name = "cel:" . $_;
if (exists $source_peps{$gene_name}) {
print OUT ">$gene_name $source_peps{$gene_name}{comment}\n" . fasta($source_peps{$gene_name}{seq});
} else {
print "Can't find a gene named $gene_name in the $source_pep_file file!\n\n";
}
}
close (GL);
close (OUT);
if (!(-e "$blast_db_name.psq")) {
print "Creating BLAST database $blast_db_name, using $blast_input_fasta as input.\n\n";
system("./bin/formatdb -i $blast_input_fasta -p T -n $blast_db_name");
}
print "BLASTing $output_subset_file against the $blast_db_name database; saving results in $blast_results_file\n\n";
system("./bin/blastall -m 8 -e 1e-03 -p blastp -d $blast_db_name -F F -i $output_subset_file -o $blast_results_file");
my $results;
{
local $/;
open (RES, $blast_results_file) or die "Can't open $blast_results_file: $!\n";
$results = <RES>;
close (RES);
}
$results = "#Query id\tSubject id\t% identity\talignment length\tmismatches\tgap openings\tq. start\tq. end\ts. start\ts. end\te-value\tbit score\n" . $results;
open (HDR, ">$blast_results_file") or die "Can't open $blast_results_file: $!\n";
print HDR $results;
close (HDR);
print "Done.\n";
sub fasta {
my $seq = shift;
my $length = length ($seq);
my ($i, $fasta);
my $len = 60;
my $out_pat = "A$len";
my $whole = int($length/$len) * $len;
for ($i = 0; $i < $whole; $i += $len) {
my $line = pack ($out_pat, substr($seq, $i, $len)) . "\n";
$fasta .= $line;
}
if (my $last = substr($seq, $i)) {
$fasta .= $last . "\n";
}
return $fasta;
}
and save this as a text file called "get_homologs.pl".
So here's where we are: Perl is installed on your machine, and the C.elegans.data directory contains the text files "c.elegans.pep", "h.sapiens.pep", "genelist.txt", and "get_homologs.pl", and has contained within it the subdirectories "bin", "data", and "doc" which were created by running (extracting) blast-2.2.23-ia32-win32.exe.
If all that's true, get a DOS window (in XP, go to Start->Run and type "cmd" in the "Open:" dialog box), navigate to the C.elegans.data directory, and execute the Perl script by typing "perl get_homologs.pl" (without the quotes) at the command prompt (>).
The script will create, among other things, a file called "c.elegans.genelist.peps.fasta" that contains the protein sequences of the C. elegans genes listed in your genelist.txt file, and "c.elegans-vs-h.sapiens.blast.tab" which contains the results of BLASTing these protein sequences against the H. sapiens database.
The results file contains tab-separated entries that look like this:
cel:W10C8.2 hsa:83439 54.32 81 35 1 186 266 341 419 2e-022 103
The column headings (also contained within the file) are:
Query id
Subject id
% identity
alignment length
mismatches
gap openings
q. start
q. end
s. start
s. end
e-value
bit score
Thus, this particular example is showing that the C. elegans W10C8.2 protein matches H. sapiens 83439 protein with 54.32% identity over an 81-amino acid stretch (query 186-266, subject 341-419) and the e-value of the hit is 2e-022.
If you'd like to generate the results file such that it contains the BLAST data in the more traditional format, change the "-m 8" switch in the blastall system call line in the Perl script to "-m 0"...
Hope this helps!
Dear Homebrew,
Thank you very much for taking the time to help me out with this! I tried exactly what you had said but there were a few problems along the way. The first was the link with the BLAST.exe file timed out, I did a search and downloaded the ncbi-blast-2.2.23+-win32.exe file from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ not sure if the update may have changed the file in someway? I have all the neccessary files as you outline: Perl is installed, and the C.elegans.data directory contains the text files "c.elegans.pep", "h.sapiens.pep", "genelist.txt", and "get_homologs.pl", and has contained within it the subdirectories "bin", "data", and "doc" which were created by running (extracting) blast-2.2.23-ia32-win32.exe.
When I use cmd perl get_homologs.pl the cmd prompt returns an error message saying: Can't open perl script "get_homologs.pl": No such file or directory
I tried manually opening the get_homologs.pl file in my C.elegan directory and it created a FAFSA file but the file is 0 kb, can't be opened, and in the process of making it there were multiple scrolling error messages.
I'm thinking I may not be nagivating properly to the C.elegans directory, in the command prompt window...I just typed "perl get_homologs.pl" in the cmd window. Any ideas of what may be going wrong?
I attached a picture of my directory, perhaps I'm missing something?
nemat on Jul 22 2010, 03:07 PM said:
I agree -- this is the problem. You need to be in the C.elegans directory (or whatever you called the directory containing the source files and the Perl script) before you issue the "perl get_homologs.pl" command, or else you'll get the "No such file or directory" error.
For this example, I'm going to assume you have a directory called "C.elegans" on your Windows XP Desktop. Here's how to navigate to that folder from the DOS command prompt.
When I go to Start->Run, and type cmd in the dialog box, a DOS window opens that looks like this:
Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.
C:\Documents and Settings\Mike>
This means I'm starting out in the C:\Documents and Settings\<username> directory, in this case the user name is "Mike". To navigate to the Desktop directory, and then to the C.elegans directory, you use the "cd" command (change directory). The command from the user directory is "cd Desktop\C.elegans", thus:
C:\Documents and Settings\Mike>cd Desktop\C.elegans
Now, assuming we didn't get an error, let's check the contents of the directory by using the "dir" command with a few switches:
C:\Documents and Settings\Mike\Desktop\C.elegans>dir /B /O
bin
data
doc
c.elegans.pep
genelist.txt
get_homologs.pl
h.sapiens.pep
Looks good -- al the required files are there. Now let's run the Perl script:
C:\Documents and Settings\Mike\Desktop\C.elegans>perl get_homologs.pl
Reading in source peptide file c.elegans.pep.
Reading in gene list file genelist.txt; writing peptide sequences to c.elegans.genelist.peps.fasta.
Creating BLAST database h.sapiens, using h.sapiens.pep as input.
BLASTing c.elegans.genelist.peps.fasta against the h.sapiens database; saving results in c.elegans-vs-h.sapiens.blast.tab
Done.
C:\Documents and Settings\Mike\Desktop\C.elegans>
Let me know if you run into any difficulties...
HB
I fixed the link in my post above. I notice you don't have a data directory in your C.elegans.data directory -- not sure if the blast+ series of stand-alone BLAST programs work the same as the non-+ versions. You should be able to DL blast-2.2.23-ia32-win32.exe now from the fixed link above.
BTW -- go Minutemen!
HB '88
Dear Homebrew,
Okay, I navigated properly to my database where all the files are. I ran the script and it seemed to go through the files but i recieved this error message which I attached. Saying it can't find (gene name) in c.elegans.pep file.
I'm wondering if I downloaded the pep files in a wrong way so they lost their formatting?
When I downloaded the pep files, I right clicked, saved target as and then downloaded it to my directory as a pep file, which I later turned into a text file. I attached an image of the formatting of the file, perhaps this is the problem?
I also attached an image of the directory after the BLAST attempt, does everything look alright now?
Thanks again for all your help! And yes GO MINUTEMEN, class of 2010 here, small world!
I realize the directory after blast didnt show up. I re-attached it to this one!