Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

Perl script - parsing non-phosphorylated protein sequences from a Uniprot flat f - (Jul/21/2006 )

Pages: 1 2 Next

Hi there

I have a set of data that contains human sequences like these (obtained from Uniprot):

ID 1433B_HUMAN STANDARD; PRT; 245 AA.
AC P31946;
OS Homo sapiens (Human).
KW 3D-structure; Acetylation; Alternative initiation;
KW Direct protein sequencing; Phosphorylation.
>1433B_HUMAN
TMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSW
RVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYL
KMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYY
EILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDA
GEGEN
ID 1433E_HUMAN STANDARD; PRT; 255 AA.
AC P62258; P29360; P42655; Q63631;
OS Homo sapiens (Human).
KW 3D-structure; Acetylation; Direct protein sequencing;
KW Host-virus interaction.
>1433E_HUMAN
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ
ID 1433F_HUMAN STANDARD; PRT; 245 AA.
AC Q04917;
OS Homo sapiens (Human).
KW 3D-structure; Acetylation; Direct protein sequencing; Phosphorylation.
>1433F_HUMAN
GDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWR
VISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKV
FYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSV
FYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEE
AGEGN

I only want to retrieve those sequences and accession numbers that are non-phosphorylated (i.e. those that don't contain the keywords "Phosphorylation".

I have created a perl script like this:

!usr/bin/perl

$infile = 'human.txt';

unless(open (INFILE, $infile))

{


print STDERR "Cannot open file \"$infile\"\n\n";

exit;
}

@infile = <INFILE>;

close INFILE;

open (OUTPUTFILE, ">C:\temp.txt");

my $line="";


while (<INFILE>){

if($line=~ /^ID/){
}elsif($line=~ /^AC/){
}elsif($line=~/^OS/){
}elsif(($line =~ /\w/) && (not $line =~ /^Phosphorylation/)){
print OUTPUTFILE "$line";}}


close OUTPUTFILE;

exit;


BUT nothing seems to happen sad.gif

Is there a problem in my script? I think there is but can't seem to figure it out. I would really appreciate any help or suggestions.

Thank you in advance

Sara

-sara.pl-

Hi Sara --

Try it this way:

CODE
#!/usr/bin/perl -w
use strict;

$/ = "\nID ";

open (IN, "human.txt") or die "Cannot open human.txt for input: $!\n";
open (OUT, ">parsed.txt") or die "Cannot open parsed.txt for output: $!\n";

while (<IN>){
    next if (/Phosphorylation/);
    chomp;
    print OUT "ID $_\n";
}

print "Done.\n";


This does nothing to alter the format of the records contained in the output file (parsed.txt), except insure they don't have the word "Phosphorylation" in them.

Do you need to reformat the output (perhaps strip out uneeded headrs or something)?

-HomeBrew-

QUOTE (HomeBrew @ Jul 21 2006, 05:55 PM)
Hi Sara --

Try it this way:

#!/usr/bin/perl -w
use strict;

$/ = "\nID ";

open (IN, "human.txt") or die "Cannot open human.txt for input: $!\n";
open (OUT, ">parsed.txt") or die "Cannot open parsed.txt for output: $!\n";

while (<IN>){
next if (/Phosphorylation/);
chomp;
print OUT "ID $_\n";
}

print "Done.\n";

This does nothing to alter the format of the records contained in the output file (parsed.txt), except insure they don't have the word "Phosphorylation" in them.

Do you need to reformat the output (perhaps strip out uneeded headrs or something)?



Hi Homebrew,

The script worked very well. Thank you! biggrin.gif

This is an output example where non-phosphorylated sequences have been found (i.e all those containing the Keyword "phosphorylation" have been removed leaving me with non-phosphorylated sets:

ID 1433E_HUMAN STANDARD; PRT; 255 AA.
AC P62258; P29360; P42655; Q63631;
OS Homo sapiens (Human).
KW 3D-structure; Acetylation; Direct protein sequencing;
KW Host-virus interaction.
>1433E_HUMAN
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ
ID 1433T_HUMAN STANDARD; PRT; 245 AA.
AC P27348; Q567U5; Q5TZU8; Q9UP48;
OS Homo sapiens (Human).
KW 3D-structure; Direct protein sequencing.
>1433T_HUMAN
MEKTELIQKAKLAEQAERYDDMATCMKAVTEQGAELSNEERNLLSVAYKNVVGGRRSAWR
VISSIEQKTDTSDKKLQLIKDYREKVESELRSICTTVLELLDKYLIANATNPESKVFYLK
MKGDYFRYLAEVACGDDRKQTIDNSQGAYQEAFDISKKEMQPTHPIRLGLALNFSVFYYE
ILNNPELACTLAKTAFDEAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDSAGEECDAA
EGAEN
ID 15E2_HUMAN STANDARD; PRT; 136 AA.
AC O43716;
OS Homo sapiens (Human).
KW Hypothetical protein.
>15E2_HUMAN
MWSRLVWLGLRAPLGGRQGFTSKADPQGSGRITAAVIEHLERLALVDFGSREAVARLEKA
IAFADRLRAVDTDGVEPMESVLEDRCLYLRSDNVVEGNCADELLQNSHRVVEEYFVAPPG
NISLPKLDEQEPFPHS

Would you know how I can put the accession numbers in place of ID's in the line containing ">"?

Take for example the last entry 15E2_HUMAN - could this be formatted like this since I only want sequence and accession number in my output file:-

>O43716
MWSRLVWLGLRAPLGGRQGFTSKADPQGSGRITAAVIEHLERLALVDFGSREAVARLEKA
IAFADRLRAVDTDGVEPMESVLEDRCLYLRSDNVVEGNCADELLQNSHRVVEEYFVAPPG
NISLPKLDEQEPFPHS

where O43716 is the accession number.

Thank you very much once again.....You've been a big help to my project! smile.gif

Sara

-sara.pl-

Sure, that should be an easy tweak. How do you want to handle records with more than one accession number (see ID 1433T_HUMAN STANDARD, for example)? We could retain just the first one, thus:

>P27348
MEKTELIQKAKLAEQAERYDDMATCMKAVTEQGAELSNEERNLLSVAYKNVVGGRRSAWR
VISSIEQKTDTSDKKLQLIKDYREKVESELRSICTTVLELLDKYLIANATNPESKVFYLK
MKGDYFRYLAEVACGDDRKQTIDNSQGAYQEAFDISKKEMQPTHPIRLGLALNFSVFYYE
ILNNPELACTLAKTAFDEAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDSAGEECDAA

or retain them all, thus:

>P27348 Q567U5 Q5TZU8 Q9UP48
MEKTELIQKAKLAEQAERYDDMATCMKAVTEQGAELSNEERNLLSVAYKNVVGGRRSAWR
VISSIEQKTDTSDKKLQLIKDYREKVESELRSICTTVLELLDKYLIANATNPESKVFYLK
MKGDYFRYLAEVACGDDRKQTIDNSQGAYQEAFDISKKEMQPTHPIRLGLALNFSVFYYE
ILNNPELACTLAKTAFDEAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDSAGEECDAA

Also, is there any other info you'd like to include on the description line (AA count, perhaps)?

-HomeBrew-

Here's an example:

CODE
#!/usr/bin/perl -w
use strict;

$/ = "\nID ";

open (IN, "human.txt") or die "Cannot open human.txt for input: $!\n";
open (OUT, ">parsed.txt") or die "Cannot open parsed.txt for output: $!\n";

while (<IN>) {
    next if (/Phosphorylation/);
    chomp;
    my ($aa) = (/PRT;\s+(\d+)\s+AA\./);
    my ($seq) = (/>.*?\n(.*)/ms);
    my ($accline) = (/AC(?:\s+(.*);)+/);
    my @acc = split /;/, $accline;
    print OUT ">" . join (",", @acc) . " ($aa AA)\n$seq\n"
}

print "Done.\n";


An example of the output is:

>P62258, P29360, P42655, Q63631 (255 AA)
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ

-HomeBrew-

An example of the output is:

>P62258, P29360, P42655, Q63631 (255 AA)
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ
[/quote]

Hi Homebrew,

Thanks for getting back to me. Is it possible to just retain first the accession number as I'm going to be using a clustering algorithm (blastclust) to cluster the sequences at 25% identity, hence so will just need one accession number.

But, I'd also like to add a random decimal number after the accessions as blastclust doesn't recognise accessions as unique identifiers so it requires a number after the accessions.

For example:

>P62258-::-0.001
sequence
>O43716-::-0.002

etc

The -::- separates the accession from the random number for blastclust to work properly. I will then write a perl script to only access the accession and remove the random number.

Thank you very much. smile.gif

Sara

-sara.pl-

Hey, I recognise that separator!!!!

I can send you the script I used to make it in the first place... all you need to do is implement the perl rand function and add that to the id line.

-perlmunky-

When you said "random number", did you really mean "unique number"? The former requires generating a random number between a particular range (not hard, but different), the latter is more straightforward. I suspect you mean "unique number". If that's so, here's one way of doing it:

CODE
#!/usr/bin/perl -w
use strict;

$/ = "\nID ";

my $count;

open (IN, "human.txt") or die "Cannot open human.txt for input: $!\n";
open (OUT, ">parsed.txt") or die "Cannot open parsed.txt for output: $!\n";

while (<IN>) {
    next if (/Phosphorylation/);
    chomp;
    $count++;
    my ($seq) = (/>.*?\n(.*)/ms);
    my ($acc) = (/AC\s+(.*?);/);
    print OUT ">$acc-::-" . $count/1000 . "\n$seq\n"
}

print "Done.\n";


Example output:

>P62258-::-0.001
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ

So, then you'll have your blastclust-compatible fasta formatted file of non-phosphorylated proteins. Next, you want to pass this to blastclust? So we're looking at something like:

CODE
system("blastclust -i parsed.txt -o bc_out.txt -S 25");

-HomeBrew-

Hi Homebrew,

Yes, what I meant was unique identifier. I will try out your perl script over the weekend and then blastclust the sequences on a Unix platform on Monday. (Windows is a bit problematic when it comes to blastclust rolleyes.gif )

The command used to execute the algorithm is:

blastclust -i parsed.txt -o clusters.txt -p T -L .9 -b T -S 25

where:
-i = the input fasta file called "pasred.txt"
-o = the output file called "clusters.txt"
-p T = indicates input file contains protein sequences and not nucleotide sequnces
-L = area covering 90% of the length of sequence
-S = I want sequences which have 25% similarity

This will cluster the sequences accordingly and print the clusters with accession numbers and the unique number to the output file.

To obtain the accession no's only (leaving behind the unique number), I created the following script:

[code]
#!/usr/bin/perl

$clustfile = 'clusters.txt';

unless(open (CLUSTFILE, $clustfile))
{
print STDERR "Cannot open file \"$clustfile\"\n\n";

exit;

}

@clustfile = <CLUSTFILE>;

close CLUSTFILE;

open (OUTPUTFILE, ">C:\accession.txt");

my $line ="";

foreach $line (@clustfile)
{

if ($line =~ /(>[OPQ].{5})/ig)

{

print OUTPUTFILE "$1\n";}}

close OUTPUTFILE;

exit;


It prints all the appropriate accession numbers at 25% identity on a new line.

Thanks once again for your help. I shall let you know how it goes.

Have a nice weekend biggrin.gif

Sara

-sara.pl-

My understanding is that -p, -L, and -b default to T, 0.9, and T, respectively, therefore there's no reason to explicitly set them, since you're using the default values, and thus all you need set is -S 25.

Your script will fail as written, generating a bunch of "print() on closed filehandle OUTPUTFILE at script_name.pl line 28." errors.

This is because you have the pathname wrong in the line:

CODE
open (OUTPUTFILE, ">C:\accession.txt");


and you've not checked for the error (contained in $!). This file is (silently) never opened, thus each time you call:

CODE
print OUTPUTFILE "$1\n";


Perl will complain about trying to print to a closed (i.e. never opened) filehandle.

Here's the deal on Windows pathnames in Perl:

">C:\accession.txt" # fail
'>C:\accession.txt' # okay
">C:/accession.txt" # okay
'>C:/accession.txt' # okay
">C:\\accession.txt" # okay
'>C:\\accession.txt' # okay


I often begin my windows-destined scripts with something like:

CODE
my $root = "$ENV{SYSTEMDRIVE}";
my $desktop = "$ENV{USERPROFILE}\\Desktop";
my $temp = "$ENV{TEMP}";


so my three most frequently written to directories are predefined, then I use them like:

CODE
open (OUT, ">$desktop/accession.txt") or die "Cannot open file $desktop/accession.txt: $!\n";


Using the predefined environmental variables also makes the script machine-independent -- I haven't hard-coded the username into the script.

Your script looks okay otherwise, except that it will fail if the accession number begins with anything other than O, P, or Q, and you haven't taken advantage of the neat '-::-' we used. I'm also not a big fan of assigning constants to variables (as in $clustfile = 'clusters.txt';).

It's unlikely to be a problem, but you might want to do a line-by-line read rather than slurp the whole file into an array, which might be trouble if we're doing genome-sized stuff with huge files.

I notice you're not using strict or enabling warnings, both of which you should always, always, always do. Did I mention you should always enable strict and warnings? biggrin.gif

But, save for the pathname problem, your script will function correctly, as far as I can tell.

I would do it this way:

CODE
#!/usr/bin/perl -w
use strict;

open (IN, "clusters.txt") or die "Cannot open file clusters.txt: $!\n";
open (OUT, ">C:/accession.txt") or die "Cannot open file accession.txt: $!\n";

while (<IN>) {
    print OUT "$1\n" if (/^>(.*)-::-/);
}


Is the problem with blastclust one of blastclust on Windows, or Perl on Windows?

-HomeBrew-

Pages: 1 2 Next