Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

Accessing local database simultaneously with a list of acc numbers - (Aug/30/2006 )

Pages: Previous 1 2 3 Next

QUOTE (HomeBrew @ Sep 1 2006, 04:15 PM)
Can you find anything else about the database that might be tripping us up? Do all the accession numbers match the pattern we're expecting? Are there any ">" characters in the DB that do not signal the start of a record? Are there any ">" characters in the DB that are not immediately followed (without spaces) by the accession number?

Hi, the program worked, yahooooooo...........great homebrew........i owe u a lot.

the accession_numbers.txt that i used contained a few different types of accession numbers(a few of them had less than 10digits), after removing which, the error did not show up. However I found this bizarre thing in the hit_list.txt... the hit list showed up only the first line of protein sequence.....and not till the end. (which is what you were referring about internal line endings...)

for eg;
accession_numbers.txt contained
>ci0100130000
>ci0100130001
>ci0100130005
>ci0100130010
>ci0100130003
>ci0100130012
>ci0100130019
>ci0100130066

The protein_db.txt contained

>ci0100130000
MPLEENISSSKRKPGSRGGVSFFSYFTQELTHGYFMDQNDARYTERRERVYTFLKQPREIEKVRPFPPFL
CLDVFLYVFTFLPLRVLFALLKLLSAPFCWFQRRSLLDPAQSCDLLKGVIFTSCVFCMSYIDTSIIYHLV
KEVTTPARLRSMRAPSVDHTVAAGTNLPSRNDDDVGDVDVLRHQAPDSVRSRKRHTATIVKATAIDEEI
H*
>ci0100130001
MLPIVDFKQCRPSVEASDKEINETAKLLVDALSTVGFAYLKNCGIKKNCRRSQKHRG*MGGVRYLYYPPI
KGELELNQERLGEHSDYGSITLLFVDDNGGLQIETEGTYKDVPVIEDTILINIGDALEFWTKGKLRSTKH
RVNIPDDEVKRNSIRRSIGYFVFPDDDVVINQPLQFKGDADVPDPVKDPITALKYIQQKLSHTCQNT*
>ci0100130003
MPPKKKKEVEKPPLILGRLGTSLKIGIVGLPNVGKSTFFNVLTKSEASAENFPFCTIDPNESRVPVPDER
WEFLCKYHKPASKVPAFLSVVDIAGLVKGANEGQGLGNAFLSHISGCDAIFHMTRAFDDAEVVHVEGDVN
KELGSESAVKSAGKYRQQGRNYIVEDGDIIFFKFNTPSQPKKK*
and so on....

ya, the hit_list.txt however showed only the first line of all the sequences.
>ci0100130000
MPLEENISSSKRKPGSRGGVSFFSYFTQELTHGYFMDQNDARYTERRERVYTFLKQPREIEKVRPFPPFL
>ci0100130001
MLPIVDFKQCRPSVEASDKEINETAKLLVDALSTVGFAYLKNCGIKKNCRRSQKHRG*MGGVRYLYYPPI
>ci0100130003
MPPKKKKEVEKPPLILGRLGTSLKIGIVGLPNVGKSTFFNVLTKSEASAENFPFCTIDPNESRVPVPDER
>ci0100130005
CQICFETYTRPKSLNCQHTFCLKCLEEYTPPNSVRVICPTCRSEQPLTADGINGLKDNFFISSMSDMLKT
>ci0100130010
ASYYQNKMVLQREPHRANIWGYGMLGANMTLSISGTNYTTTVRVGPNKAFVWNFILPPYKAGGPFSIKVY
>ci0100130012
MEERNNAVLIASHPNIQPGSMSVDATHKLAMLASRRYLTLIDLTEPNRVIERVNLRNKWDVSHVLWNPTT
>ci0100130019
MGRVRTKTVKKAARVIIEKYYMKLTLDFHTNKRVCEEIAIIPSKKLRNQIAGFVTHLMKRIRVGPVRGIS
>ci0100130066
MKIPFLSWILHSQKFQFLCGIVQMYLASALKHMHISDPHLDRFDAKTSMKKDAFESSNISSLRHSAAS

Can that thing be set up....to show the full sequence

-kamesh-

Yes, I can fix it easily. But first, give me examples of the other accession numbers that your list contains so we can get the full list from the database.

In order for the script to work without limitations, I must know all that it might encounter when reading the accessions list and the protein database. Review the four assumptions I outlined above, and let me know of any changes that need to be made (so far, we've broken 3 out of 4 assumptions biggrin.gif ).

-HomeBrew-

QUOTE (HomeBrew @ Sep 1 2006, 06:31 PM)
Yes, I can fix it easily. But first, give me examples of the other accession numbers that your list contains so we can get the full list from the database.

In order for the script to work without limitations, I must know all that it might encounter when reading the accessions list and the protein database. Review the four assumptions I outlined above, and let me know of any changes that need to be made (so far, we've broken 3 out of 4 assumptions biggrin.gif ).


hi homebrew,

1. the accession_numbers.txt contain a list of accession numbers beginning with a '> 'character, and one accession number per line

2. I figured out that i could do without the accession numbers that were different from the pattern we assumed. Infact i dont need them. so >ci0100130000 and its likewise hold good. so the accession numbers will conform to the pattern of a ' >' followed by two lower-case letters and inturn followed by 10 digits without any breaks.

3. The protein_db.txt has an accession number preceeded by a ' >' and terminated by a newline, followed by the protein sequencethat may/maynot have a internal broken line.

4. The next record in the protein database may follow the former with/without any blank lines.

.

-kamesh-

Okay -- try this:

CODE
#!/usr/bin/perl -w
use strict;

open (ACC, "accession_numbers.txt") or die "Can't open accession_numbers.txt: $!\n";
open (DB, "protein_db.txt") or die "Can't open protein_db.txt: $!\n";
open (HIT, ">hit_list.txt") or die "Can't open hit_list.txt: $!\n";
open (MISS, ">miss_list.txt") or die "Can't open miss_list.txt: $!\n";

my @acc;

while (<ACC>) {
    if (/^>([a-z]{2}\d{10})/) {
        push (@acc, $1);
    } else {
        chomp;
        print "Error in accession_numbers.txt:\n$_\nfails to meet expected pattern.\n";
        print MISS "$_ (fails pattern match)\n";
    }
}

$/ = ">";

while (<DB>) {
    my ($seq_name, $seq) = ($_ =~ /([a-z]{2}\d{10})\n(.+)/s);
    if ($seq_name && $seq) {
        $seq =~ s/>//g;
        $seq =~ s/(?<!\w)\n//g;
        for (my $i = 0; $i <= $#acc; $i++) {
            if ($acc[$i] eq $seq_name) {
                print HIT ">$acc[$i]\n$seq";
                splice(@acc, $i, 1);
            }
        }
    } elsif ($seq_name) {
        print "Error reading sequence portion of database record.\n";
    } elsif ($seq) {
        print "Error reading accession line of database record.\n";
    }
}
print MISS join ("\n", @acc);


I've added a bit of error checking, and some blank line and other cleanup routines...

-HomeBrew-

[quote name='HomeBrew' date='Sep 1 2006, 10:18 PM' post='66985']
Thanks a lot home brew... the code fetched the entire sequences...but yes, the result looked something like this...the records follow each other without a break... preceding the '>' in the accession number.
>ci0100130000
MPLEENISSSKRKPGSRGGVSFFSYFTQELTHGYFMDQNDARYTERRERVYTFLKQPREIEKVRPFPPFL
KEVTTPARLRSMRAPSVDHTVAAGTNLPSRNDDDVGDVDVLRHQAPDSVRSRKRHTATIVKATAIDEEI
H*>ci0100130001
MLPIVDFKQCRPSVEASDKEINETAKLLVDALSTVGFAYLKNCGIKKNCRRSQKHRG*MGGVRYLYYPPI
KGELELNQERLGEHSDYGSITLLFNQPLQFKGDADVPDPVKDPITALKYIQQKLSHTCQNT*>ci0100130003
MPPKKKKEVEKPPLILGRLGTSLKIGIVGLPNVGKSTFFNVLTKSEASAENFPFCTIDPNESRVPVPDER
KELGSESAVKSAGKYRQQGRNYIVEDGDIIFFKFNTPSQPKKK*>ci0100130005
CQICFETYTRPKSLNCQHTFCLKCLEEYTPPNSVRVICPTCRSEQPLTADGINGLKDNFFISSMSDMLKT

However, i wanna say something at this point.... on course of this there are some things that you have shown...

1. you are really good at helping people...... the simple fact that you hear them out and spend time on the problem is heartening....

2. you have shown that learning perl will be worth it... and be fun too. Though for me to churn up with such a script would have taken many many hours of learning...infact understanding the script u have written will take time for me....but this aint gonna stop me... i have a lot of other job to do....and i will go the perl way for sure....thanks a lot for inspiring....

-kamesh-

Okay -- a quick fix:

Change the line:

print HIT ">$acc[$i]\n$seq";

to:

print HIT ">$acc[$i]\n$seq\n";

That should fix that problem. I just hope it doesn't introduce blank line between some entries in the hit_list.txt file. If it does, let me know...

-HomeBrew-

QUOTE (HomeBrew @ Sep 2 2006, 04:06 PM)
Okay -- a quick fix:

Change the line:

print HIT ">$acc[$i]\n$seq";

to:

print HIT ">$acc[$i]\n$seq\n";

That should fix that problem. I just hope it doesn't introduce blank line between some entries in the hit_list.txt file. If it does, let me know...

it did home brew..it did.... the problem has been fixed.....kudos.......for doing a wonderful job...with the entire thing....

-kamesh-

Great -- so are we done tweaking?

I might carry it one step further, just to make it generic and applicable or adaptable for anyone who might trip over it on Google....

I encourage anyone who deals with huge files of sequence data (which virtually any life scientist has to do eventually) to learn Perl -- it is exquisitely adept at handling such text-based tasks, and can make otherwise daunting tasks quite easy.

-HomeBrew-

QUOTE (HomeBrew @ Sep 2 2006, 07:29 PM)
I might carry it one step further, just to make it generic and applicable or adaptable for anyone who might trip over it on Google....


ya, a generic tool would be a fitting end to this.... way to go.....smile.gif

-kamesh-

QUOTE (kamesh @ Sep 3 2006, 06:27 AM)
QUOTE (HomeBrew @ Sep 2 2006, 07:29 PM)

I might carry it one step further, just to make it generic and applicable or adaptable for anyone who might trip over it on Google....


ya, a generic tool would be a fitting end to this.... way to go.....smile.gif

hi homebrew,

Can you comment on each step on what the program does/ or can you just write down the algorithm???. I think it would be of help to me and to many other people looking over the posts.....

have a good day!!!

-kamesh-

Pages: Previous 1 2 3 Next