Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

Database filtering - (Sep/03/2007 )

hello everyone,

does anybody know how to filter databases? i have databases with maize sequences in a fasta format but i'm only interested in those from endosperm, the format the databases have is

>id acc
sequence

how could i extract only those that mention endosperm in the id? thank you for your help.

-rodpck-

QUOTE (rodpck @ Sep 3 2007, 11:58 AM)
hello everyone,

does anybody know how to filter databases? i have databases with maize sequences in a fasta format but i'm only interested in those from endosperm, the format the databases have is

>id acc
sequence

how could i extract only those that mention endosperm in the id? thank you for your help.


use perl. Something like the script below should work ... I have not tested this one and I am a bit frazzled from thesis writing so it may not work.

CODE
#!/usr/bin/perl

use warnings;
use strict;

my $fileName = $ARGV[0] or die "I need the location and name of the FASTA file";
open F, $fileName or die $!;
my %hash = ();

my $cid = "";
while(<F>) {

   if ( /^>/ ) {
       $cid = $_;
       next;
    }

   if ($cid =~ /endosperm/ig) {
      $hash{$cid} = $_;
   }
}

for (keys %hash){
    print "$_\n$hash{$_}\n";
}

-perlmunky-

hi perlmunky,

the script worked really nicely. thank you very much. i hope you finish your phd thesis soon, it's a pain in the ass. i am sure my institute needs good bioinformatics people, just in case you are interested moving to México...hahaha.

thanks again.

-rodpck-