Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

script problems - extracting sequences (May/16/2008 )

Pages: 1 2 Next

hello,

i have a little problem when running this script:

CODE
#!/usr/bin/perl
# Script to extract sequences from file

open (MID, "mid1.query");

while (<MID>)
{
        chomp;
        $linea = $_;
        system("grep $linea mid1.fasta >> query.nr");
}
close MID;


i'm interested in extracting the sequences mentioned in mid1.query (it contains only the accession number) from the mid1.fasta into the query.nr file. the thing is, i only get the first line and not the sequence, something like:

>sequence1
>sequence2

instead of:

>sequence1
ATTTACTGGTAGGGCCA
>sequence2
TAGGGCCATTTACTGGTAGG

some of the sequences have one or two lines below of the sequence, could someone help me figure this out? thanks a lot.

-toejam-

nevermind, wrong command happy.gif

the right answer is (in case any of you ever need it):

CODE
#!/usr/bin/perl
# Script to extract sequences from file

open (MID, "mid1.query");

while (<MID>)
{
        chomp;
        $linea = $_;
        system("seqret mid1.fasta:$linea -stdout $linea.fasta;
cat $linea.fasta >> query.nr;
rm $linea.fasta;");

}
close MID;

-toejam-

Hi!
On the first place, I suggest you to have a look at bioperl code, since there are already some functions to do what you want and better, without having to write and test new code.
I am not too much in perl (I prefer python and biopython) so I can't tell you which are the modules to read fastas, but they shuold be something like SeqIO or similar.

The other thing is that instead of calling a system's function, you should use the perl's "grep" command, or better a regular expression:

CODE
if ($line =~ m/^>(.*)$/)
{
print $_;
}


Don't forget to make some tests for this code before using it, for example, see what happens if your fasta headers contains some ';' or if you read an empty file.

-dalloliogm-

hello dalloliogm.

i thought about using grep at first, but a problem is that some sequences in this database are in 1 line, others in 2, and some others more, but to call grep i'd need to use system command as well.
some emboss commands are really slow, it took almost 5 days to extract some 36,000 sequences from a 70,000 seq database blink.gif i'll also take a look at the bioperl code as suggested. thank you.

-toejam-

QUOTE (toejam @ May 27 2008, 07:56 AM)
hello dalloliogm.

i thought about using grep at first, but a problem is that some sequences in this database are in 1 line, others in 2, and some others more, but to call grep i'd need to use system command as well.
some emboss commands are really slow, it took almost 5 days to extract some 36,000 sequences from a 70,000 seq database blink.gif i'll also take a look at the bioperl code as suggested. thank you.



Hi,

There are a couple of things here:
1) I suggest would suggest that you avoid bioperl if you can, it is a heaving behemoth of f.ugly code that will reduce the portability of your code - biopython is somewhat neater and tidier than bioperl smile.gif
2) If you use system, you should do the following:
CODE
system("perl -e 'print q/hello world!/'") or die "I Failed for the following reason: $!\n"


3) If you don't care about the returning a status code use back-ticks to save yourself some typing:
CODE
#!/usr/bin/perl
use strict;
`echo I like perl\, but it doesn'\t like me`


4) Perl has it's own grep

CODE
#open a file as you have
my @complete_fasta_file = <F>;
my @ids = grep { /^>/ } @complete_fasta_file



If you want to get the sequences and the ids you can do this:

CODE
#!/usr/bin/perl
use warnings;
use strict;

open F, $ARGV[0] or die $!; #if the file is specified on the command line.
my %data = (); #A hash keys = id, value = seq

my $id = '';
my $seq = '';

while(<F>) {
    chomp;
   if ( /^>/ ) {
         $id = $_;
         next;
    }
    $data{$id} .= $_;
}

#Then you can loop over your hash and print the keys and sequence you want

-perlmunky-

hi
thanks for the suggestion perlm, at the end i sent it to the cluster at the institute biggrin.gif i didn't know perl had its own grep command. and once i worked with the arrays my computer went crazy, gotta be careful with those!
cheers.

-toejam-

If you, or anyone else for that matter, has a perl problem in the future, feel free to drop me a message or skype me.

blink.gif

-perlmunky-

QUOTE
1) I suggest would suggest that you avoid bioperl if you can, it is a heaving behemoth of f.ugly code that will reduce the portability of your code - biopython is somewhat neater and tidier than bioperl smile.gif
Can you argue this?
I don't have used bioperl too much, but some functions are very good to use.
For example, I have used Bio::AlignIO to print an alignment in different formats, and it worked very well.
It was very useful: just by changing a variable, I could print the alignment in a different format.
Also, code from bioperl is already tested and kept up-to-date, which is not a thing to underestimate.
I can assure you that biopython is smaller than bioperl, as it misses a lot of functions and modules.

-dalloliogm-

Yes, I can argue this point.

Perl is portable, to some degree - I can take a piece of code I write and run it on your machine. If however I used bioperl I would have to then install all my modules on your machine - this can be really annoying if you don't have root on your machine - especially if you are coding for multiple people at multiple locations using multiple machines.

With the sequence stuff you are touching the tip of the iceberg - bioperl is a bloated monster - there is a lot to it.

I know biopython is small - I can and do code in several languages many of which have bio* projects - bioperl is the most mature ... there are large groups of people that develop it at places like the sanger and ebi in the uk.

My preference is to write my own stuff.

-perlmunky-

QUOTE (perlmunky @ May 29 2008, 06:20 PM)
Perl is portable, to some degree - I can take a piece of code I write and run it on your machine. If however I used bioperl I would have to then install all my modules on your machine - this can be really annoying if you don't have root on your machine - especially if you are coding for multiple people at multiple locations using multiple machines.

With the sequence stuff you are touching the tip of the iceberg - bioperl is a bloated monster - there is a lot to it.

That's ok, even if I don't agree to everything you are saying.

The good thing of perl is that you can easily install modules from cpan - you don't need to have root privileges and it only takes to write a command on the command line.
Really, I think perl is an awful programming language for bioinformaticists, I would never have used it if it were not for all the modules in cpan, bioperl and ensembl smile.gif.

-dalloliogm-

Pages: 1 2 Next