identifying common acc.numbers from among a set of files - (Sep/08/2006 )

Hi all, I am a newbie to perl programming. I currently have my task cut out . I have a set of text files (accno1.txt, accno2.txt, accno3.txt......accno'n'.txt) files, in which I have a list of acc.numbers
for example

accno1.txt may contain:

>AC1123413009
>AC1123430574
>AC1123430090
>AC1123430804
>AC1123430945
>AC1123440986
>AC1123430090

accno2.txt may contain:

>AC1123430100
>AC1123430347
>AC1123430090
>AC1123430903
>AC1123430945
>AC1123440986
>AC1123430090

accno3.txt

>AC1123440100
>AC1123450320
>AC1123460347
>AC1123450090
>AC1123430903
>AC1123439745
>AC1123440986

accno'n'.txt
......
........

I would want the output file for a program telling me, that

output file.txt

>AC1123430090 - is common in accno1.txt and accno2.txt ,
>AC1123440986 - is common in accno1.txt, accno2.txt, accno3.txt,
.........

I have been trying perl scripting but have been going nowhere...any help is welcome...

-perlnovice-

Hello.

As you have been trying perl I will give you a hint rather than a coded solution!
The first thing to do, if you like perl, is get yourself a copy of the perl cookbook... there is a solution to this problem right there... I don't have my copy with me to give you a page reference because I am not at my desk.

What you could do it open a hash arrays - the array should be as large as the number of files you have. Read through each file - the key to the hash should be the accession number.

$hash{'>AC1123413009'} = [0,0,0,0,0];

if you see that code in file 0 then:

$hash{'>AC1123413009'} = [1,0,0,0,0];

so at the end you have a binary array you can look at to determine where that accession code is.

There is probably a better way but I haven't been coding for several weeks.

-perlmunky-

This is indeed a bit tricky, mostly because (according to your sample data) an accession number can appear twice in the same file.

So we actually have four situations:

The accession number is unique, it appears just once total.
The accession number is not unique because it appears more than once in a single file.
The accession number is not unique because it appears once in each of two or more files.
The accession number is not unique because it appears once or more than once in two or more files.

Here's one way to do it:

CODE

#!/usr/bin/perl -w
use strict;
use Cwd;

my %data;
my @files = <accno*.txt>;

die "Error: no files matching accno*.txt found in " . getcwd() . ".\n" if (@files == 0);

open (ONE, ">unique.txt") or die "Can't open unique.txt: $!\n";
open (MULTI, ">multiple.txt") or die "Can't open multiple.txt: $!\n";

foreach my $file (@files) {
    open (IN, "$file") or die "Can't open $file: $!\n";
    while (<IN>) {
        chomp;
        s/\s+//g;
        $data{$_}{$file}++;
    }
    close (IN);
}

foreach my $acc (sort keys %data) {
    if (keys %{$data{$acc}} > 1) {
        print MULTI "$acc appears in:\n";
        foreach my $file (sort keys %{$data{$acc}}) {
            print MULTI "\t$file ($data{$acc}{$file})\n";
        }
        print MULTI "\n";
    } else {
        foreach my $file (sort keys %{$data{$acc}}) {
            if ($data{$acc}{$file} > 1) {
                print MULTI "$acc appears in:\n\t$file ($data{$acc}{$file})\n\n";
            } else {
                print ONE "$acc\n";
            }
        }
    }
}

print "Done -- processed " . @files . " files.\n";

Let me know if this does what you need...

-HomeBrew-

hi perlnovice & homebrew...

that was an interesting post...

once again if you...ya, if you can....just try commenting on the perl codes.....

have a nice day

-kamesh-

Hi homebrew, that was brilliant...thanks a lot.

I am happy with the multiple.txt - output format.

the unique.txt shows up like this.
>AC1123413009
>AC1123430100
>AC1123430347
>AC1123430574
>AC1123430804
>AC1123439745
>AC1123440100
>AC1123450090
>AC1123450320
>AC1123460347

Can the acc.no in the unique.txt list be sorted based on their file source like as in

accno1.txt
>AC1123413009
>AC1123430574
>AC1123430804
accno2.txt
>AC1123430347
>AC1123430100
accno3.txt
>AC1123439745
>AC1123440100
>AC1123450090
>AC1123450320
>AC1123460347

thanks once again

-perlnovice-

Sure, perlnovice -- all things are possible...

Try this:

CODE

#!/usr/bin/perl -w
use strict;
use Cwd;

my (%data, %unique);

my @files = <accno*.txt>;

die "Error: no files matching accno*.txt found in " . getcwd() . ".\n" if (@files == 0);

open (ONE, ">unique.txt") or die "Can't open unique.txt: $!\n";
open (MULTI, ">multiple.txt") or die "Can't open multiple.txt: $!\n";

foreach my $file (@files) {
    open (IN, "$file") or die "Can't open $file: $!\n";
    while (<IN>) {
        chomp;
        s/\s+//g;
        $data{$_}{$file}++;
    }
    close (IN);
}

foreach my $acc (sort keys %data) {
    if (keys %{$data{$acc}} > 1) {
        print MULTI "$acc appears in:\n";
        foreach my $file (sort keys %{$data{$acc}}) {
            print MULTI "\t$file ($data{$acc}{$file})\n";
        }
        print MULTI "\n";
    } else {
        foreach my $file (sort keys %{$data{$acc}}) {
            if ($data{$acc}{$file} > 1) {
                print MULTI "$acc appears in:\n\t$file ($data{$acc}{$file})\n\n";
            } else {
                push @{$unique{$file}}, $acc;
            }
        }
    }
}

foreach my $key (sort keys %unique) {
    print ONE "$key\n", join ("\n", @{$unique{$key}}), "\n";
}

print "Done -- processed " . @files . " files.\n";

Better?

-HomeBrew-

Perhaps anticipating your next question...

If you'd like the set of accession numbers appearing under each file name in the unique output file to also be sorted, replace the "foreach my $key (sort keys %unique)" function above with:

CODE

foreach my $key (sort keys %unique) {
@{$unique{$key}} = sort @{$unique{$key}};
print ONE "$key\n", join ("\n", @{$unique{$key}}), "\n";
}

-HomeBrew-

QUOTE (HomeBrew @ Sep 8 2006, 04:03 PM)

thanks homebrew. it worked...and u saved my day.....

-perlnovice-

No problem, perlnovice -- glad to help. This is one of those needs that when done by hand could take hours and is prone to errors. Perl can do it for you errorlessly in less than a second and with a lot less effort...

-HomeBrew-