Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

Accessing local database simultaneously with a list of acc numbers - (Aug/30/2006 )

Pages: 1 2 3 Next

hi ,

I am new to the programming and computational world...and now i have this daunting task at hand...

I have a text file (input) in which i have a list of protein accession numbers like

>ci0100130000
>ci0100130001
>ci0100130005
>ci0100130010
........................

...and i would want to retrieve the sequences that correspond to this accession number from a local protein database (which is in fasta format) .
which looks something like this...

>ci0100130000
MPLEENISSSKRKPGSRGGVSFFSYFTQELTHGYFMDQNDARYTERRERVYTFLKQPREIEKVRPFPPFL
>ci0100130001
MLPIVDFKQCRPSVEASDKEINETAKLLVDALSTVGFAYLKNCGIKKNCRRSQKHRG*MGGVRYLYYPPI
>ci0100130002
MNWKTWEEMENDLGIYYRPTNRKLDRRKGPIEEGQINFKITIPSTLKRKIKHDVDKNLNEELIENADKQQ
>ci0100130003
MPPKKKKEVEKPPLILGRLGTSLKIGIVGLPNVGKSTFFNVLTKSEASAENFPFCTIDPNESRVPVPDER


..Any ideas as to how i can go about it...

thanks a zillion...beforehand

-kamesh-

This should be relatively straight forward... A few questions first:

  1. What operating system are you using?
  2. Do you have Perl available?
  3. Is the protein database stored on the same machine as the list of accession numbers, or is it networked?
  4. How do you normally access this protein database?
  5. How big (apx. megabytes) is the protein database?

-HomeBrew-

QUOTE (HomeBrew @ Aug 31 2006, 03:39 AM)
This should be relatively straight forward... A few questions first:
  1. What operating system are you using?

    Using Red hat linux
  2. Do you have Perl available?

    using perl v 5.8.5
  3. Is the protein database stored on the same machine as the list of accession numbers, or is it networked?

    yup, the protein database is stored in the same machine as the list of accession numbers.
  4. How do you normally access this protein database?

    I normally copy/ paste the sequence, once after I grep the accession number , in the database file. But that will take eternity......
  5. How big (apx. megabytes) is the protein database?
the (fasta)database is nearly 100 MB.

any help with perl scripting in this regard will be greatly appreciated......

-kamesh-

Okay, I'll give it a shot. I'm pretty busy, so it might take me a day or so to post something up here...

Seems like my approach will be to read the list of accession numbers into a hash, then read the database file (setting the record seperator to ">", thus one protein sequence at a time), then compare the first line of the record (sequence) to the keys of the hash (accession numbers), and if there's a match, print that record out to a file...

If your database were smaller, I'd just slurp the whole thing into RAM, but there's no need to do that -- we can pull one sequence in at a time easily enough...

-HomeBrew-

QUOTE (kamesh @ Aug 31 2006, 12:57 PM)
any help with perl scripting in this regard will be greatly appreciated......


Well I have to say HomeBrew is the King of perl scripting! biggrin.gif (And that's a fact!)

-sara.pl-

Give this a try:

CODE
#!/usr/bin/perl -w
use strict;

open (ACC, "accession_numbers.txt") or die "Can't open accession_numbers.txt: $!\n";
open (DB, "protein_db.txt") or die "Can't open protein_db.txt: $!\n";
open (HIT, ">hit_list.txt") or die "Can't open hit_list.txt: $!\n";
open (MISS, ">miss_list.txt") or die "Can't open miss_list.txt: $!\n";

my @acc;

while (<ACC>) {
    /^>(([a-z]{2}\d{10}))\n/;
    push (@acc, $1);
}

$/ = ">";

while (<DB>) {
    my ($seq_name, $seq) = ($_ =~ /([a-z]{2}\d{10})\n(.+)/);
    if ($seq_name && $seq) {
        for (my $i = 0; $i <= $#acc; $i++) {
            if ($acc[$i] eq $seq_name) {
                print HIT ">$acc[$i]\n$seq\n";
                splice(@acc, $i, 1);
            }
        }
    }
}
print MISS join ("\n", @acc);


This script has certain expectations based on your examples above:
  1. It expects accession_numbers.txt to contain a list of accession numbers beginning with a > character, one per line.
  2. It expects each record in protein_db.txt to have an accession number preceeded by a > and terminated by a newline, followed by the protein sequence in a single, unbroken line.
  3. It expects the next record in the protein database will follow the former without any blank lines.
  4. It expects that the accession numbers will conform to the pattern of two lower-case letters followed by 10 digits.
I wasn't sure if you wanted to keep a list of accession numbers that weren't found in the protein database, so I included it anyhow under the theory that you'd rather have it and not need it than need it and not have it...

Let me know if we need to tune this up in any way...

-HomeBrew-

QUOTE (HomeBrew @ Sep 1 2006, 11:08 AM)
Give this a try:

CODE
#!/usr/bin/perl -w
use strict;

open (ACC, "accession_numbers.txt") or die "Can't open accession_numbers.txt: $!\n";
open (DB, "protein_db.txt") or die "Can't open protein_db.txt: $!\n";
open (HIT, ">hit_list.txt") or die "Can't open hit_list.txt: $!\n";
open (MISS, ">miss_list.txt") or die "Can't open miss_list.txt: $!\n";

my @acc;

while (<ACC>) {
    /^>(([a-z]{2}\d{10}))\n/;
    push (@acc, $1);
}

$/ = ">";

while (<DB>) {
    my ($seq_name, $seq) = ($_ =~ /([a-z]{2}\d{10})\n(.+)/);
    if ($seq_name && $seq) {
        for (my $i = 0; $i <= $#acc; $i++) {
            if ($acc[$i] eq $seq_name) {
                print HIT ">$acc[$i]\n$seq\n";
                splice(@acc, $i, 1);
            }
        }
    }
}
print MISS join ("\n", @acc);


This script has certain expectations based on your examples above:
  1. It expects accession_numbers.txt to contain a list of accession numbers beginning with a > character, one per line.
  2. It expects each record in protein_db.txt to have an accession number preceeded by a > and terminated by a newline, followed by the protein sequence in a single, unbroken line.
  3. It expects the next record in the protein database will follow the former without any blank lines.
  4. It expects that the accession numbers will conform to the pattern of two lower-case letters followed by 10 digits.
I wasn't sure if you wanted to keep a list of accession numbers that weren't found in the protein database, so I included it anyhow under the theory that you'd rather have it and not need it than need it and not have it...

Let me know if we need to tune this up in any way...


Hi home brew,

Thanks a lot for helping me on that........

I ran the program, but however this is what i got......

Use of uninitialized value in string eq at localdbaccess.pl line 22, <DB> chunk 1.
.........
...........
Use of uninitialized value in string eq at localdbaccess.pl line 22, <DB> chunk 21.
Which kept running on..........

and the output did not write in into the hit and miss list files..... ( I am ok with the other assumptions that u have made in building up the script )

-kamesh-

Are you sure you copied the code correctly? I get no errors at all when run on either a Windows or a Linux machine...

It's either a faulty assumption about the input files, or it might be a line-ending problem. Here's a slightly changed version, in case it's a line ending problem:

CODE
#!/usr/bin/perl -w
use strict;

open (ACC, "accession_numbers.txt") or die "Can't open accession_numbers.txt: $!\n";
open (DB, "protein_db.txt") or die "Can't open protein_db.txt: $!\n";
open (HIT, ">hit_list.txt") or die "Can't open hit_list.txt: $!\n";
open (MISS, ">miss_list.txt") or die "Can't open miss_list.txt: $!\n";

my @acc;

while (<ACC>) {
    /^>([a-z]{2}\d{10})/;
    push (@acc, $1);
}

$/ = ">";

while (<DB>) {
    my ($seq_name, $seq) = ($_ =~ /([a-z]{2}\d{10})[\n\r](.+)/);
    if ($seq_name && $seq) {
        for (my $i = 0; $i <= $#acc; $i++) {
            if ($acc[$i] eq $seq_name) {
                print HIT ">$acc[$i]\n$seq\n";
                splice(@acc, $i, 1);
            }
        }
    }
}
print MISS join ("\n", @acc);


I'm using the following as input:

accession_numbers.txt contains:

>ci0100130000
>ci0100130001
>ci0100130005
>ci0100130010
>ci0100130003
>ci0100130012

protein_db.txt contains:

>ci0100130000
MPLEENISSSKRKPGSRGGVSFFSYFTQELTHGYFMDQNDARYTERRERVYTFLKQPREIEKVRPFPPFL
>ci0100130001
MLPIVDFKQCRPSVEASDKEINETAKLLVDALSTVGFAYLKNCGIKKNCRRSQKHRG*MGGVRYLYYPPI
>ci0100130002
MNWKTWEEMENDLGIYYRPTNRKLDRRKGPIEEGQINFKITIPSTLKRKIKHDVDKNLNEELIENADKQQ
>ci0100130003
MPPKKKKEVEKPPLILGRLGTSLKIGIVGLPNVGKSTFFNVLTKSEASAENFPFCTIDPNESRVPVPDER
>ci0100130012
MPPKKKKEVEKPPLILGRLGTSLKIGIVGLPNVGKSTFFNVLTKSEASAENFPFCTIDPNESRVPVPDER

Neither of the input files has any blank lines at the top, bottom, or in the body of the file. When I run this script against the input files above, my output files are created and are as follows:

hit_list.txt contains:

>ci0100130000
MPLEENISSSKRKPGSRGGVSFFSYFTQELTHGYFMDQNDARYTERRERVYTFLKQPREIEKVRPFPPFL
>ci0100130001
MLPIVDFKQCRPSVEASDKEINETAKLLVDALSTVGFAYLKNCGIKKNCRRSQKHRG*MGGVRYLYYPPI
>ci0100130003
MPPKKKKEVEKPPLILGRLGTSLKIGIVGLPNVGKSTFFNVLTKSEASAENFPFCTIDPNESRVPVPDER
>ci0100130012
MPPKKKKEVEKPPLILGRLGTSLKIGIVGLPNVGKSTFFNVLTKSEASAENFPFCTIDPNESRVPVPDER

miss_list.txt contains:

ci0100130005
ci0100130010

Please try the script above against the same input as I am (provided above -- save them as text files) and tell me what occurs. As I said, I get no errors on Windows ar Linux, so it's hard for me to troubleshoot...

-HomeBrew-

QUOTE (HomeBrew @ Sep 1 2006, 02:20 PM)
A

Neither of the input files has any blank lines at the top, bottom, or in the body of the file. When I run this script against the input files above, my output files are created and are as follows:

Please try the script above against the same input as I am (provided above -- save them as text files) and tell me what occurs. As I said, I get no errors on Windows ar Linux, so it's hard for me to troubleshoot...

hi,

It worked(linux) when i used the same input as u did ( the input acession_number.txt and the protein_db.txt that you posted )and the hit and miss list showed up what it had to....

It however didnt work for my protein database that i wanted to access, as it showed up the same error....

Use of uninitialized value in string eq at localdbaccess.pl line 22, <DB> chunk 1.
......on and on...

bty, i had said to you that my protein database file is a big one (100 MB), and at the outset it appeared that there were no blank lines in it (none at the top and bottom). However ,as i scrolled through, i found that it did contain blank lines at many places and an occassional ' * ' symbol within the protein sequence or at the end of the protein sequence. I suppose the blank lines are what proabably is causing this. it will be impossible to manually remove the blank lines.

can vi do the trick??....

-kamesh-

QUOTE (kamesh @ Sep 1 2006, 06:32 PM)
bty, i had said to you that my protein database file is a big one (100 MB), and at the outset it appeared that there were no blank lines in it (none at the top and bottom). However ,as i scrolled through, i found that it did contain blank lines at many places and an occassional ' * ' symbol within the protein sequence or at the end of the protein sequence. I suppose the blank lines are what proabably is causing this. it will be impossible to manually remove the blank lines.


Hmm... Even If I intentionally introduce blank lines or line breaks into the DB, I don't get the errors you're reporting. For example, running the script against:

>ci0100130000
MPLEENISSSKRKPGSRGGVSFFSYFTQELTHGYFMDQNDARYT
ERRERVYTFLKQPREIEKVRPFPPFL
>ci0100130001
MLPIVDFKQCRPSVEASDKEINETAKLLVDALSTVGFAYLKNCGIKKNCRRSQKHRG*MGGVRYLYYPPI


>ci0100130002
MNWKTWEEMENDLGIYYRPTNRKLDRRKGPIEEGQINFKITIPSTLK
RKIKHDVDKNLNEELIENADKQQ
>ci0100130003
MPPKKKKEVEKPPLILGRLGTSLKIGIVGLPNVGKSTFFNVLTKSEASAENFPFCTIDPNESRVPVPDER

>ci0100130012
MPPKKKKEVEKPPLILGRLGTSLKIGIVGLPNVGKSTFFNVLTKSEASAENFPFCTIDPNESRVPVPDER

...as protein_db.txt still "works" -- the sequences with internal line endings, like ci0100130002, are reduced to their first line in hit_list.txt, but I don't get the Perl errors you're getting...

There is something else about the contents of either of your input files (most likely the protein DB) that is not as we assume. It can't be internal * characters, because our sample DB has that -- see ci0100130001 -- and the script still works on it.

A simple Perl way to remove the blank lines would be:

CODE
open (IN, "protein_db.txt") or die;
open (OUT, ">protein_db2.txt") or die;

while (<IN>) {
    print OUT unless /^\s*$/;
}


Can you find anything else about the database that might be tripping us up? Do all the accession numbers match the pattern we're expecting? Are there any ">" characters in the DB that do not signal the start of a record? Are there any ">" characters in the DB that are not immediately followed (without spaces) by the accession number?

-HomeBrew-

Pages: 1 2 3 Next