Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

Working with a table - Check the duplicate samples have same alleles (May/18/2005 )

unsure.gif
Dear All,
I am a Perl beginner .
I could not solve this small problem. I am trying to work with a GeneMapper exported file. I will appreciate your help.

The input file for the script is 1.txt
The script I have worked so far is below.
The objects are
1. Any GQ is not 1, change all the Allele1 and Allele 2 columns into nil. ( I have managed this object in the script)
2. Check the same sample & marker name in UD1 whether have same Allele1 and Allele 2 or not.( I think I should use hash but cannot do it)
3. Give an ouput file format as UD1 without any duplicated sample & marker names. Format markers's name in a column ie. see 2.exl
Thank you
Jin


==========Script =====================
$GenemapperFile ="c:/1.txt";
open (FILE, "<$GenemapperFile") or die "Unable to open the file $GenemapperFile;$!";
$tableheader=readline(FILE);
while (<FILE>)
{
$size = length $line;
#print "The size of each line is $size\n"; # output size of line
@x=split(/\t/);
push @SampleFile, $x[0];
push @SampleName, $x[1];
push @SampleID, $x[2];
push @RunName, $x[3];
push @Panel, $x[4];
push @Marker, $x[5];
push @Dye, $x[6];
push @SNP, $x[7];
push @Allele1, $x[8];
push @Allele2, $x[9];
push @Size1, $x[10];
push @Size2, $x[11];
push @Height1, $x[12];
push @Height2, $x[13];
push @PeakArea1, $x[14];
push @PeakArea2, $x[15];
push @DataPoint1, $x[16];
push @DataPoint2, $x[17];
push @Mutation1, $x[18];
push @Mutation2, $x[19];
push @AEComment1, $x[20];
push @AEComment2, $x[21];
push @ADO, $x[22];
push @AE, $x[23];
push @OMIT, $x[24];
push @OS, $x[25];
push @SHP, $x[26];
push @OBA, $x[27];
push @SPA, $x[28];
push @SP, $x[29];
push @BIN, $x[30];
push @PHR, $x[31];
push @LPH, $x[32];
push @SPU, $x[33];
push @AN, $x[34];
push @BD, $x[35];
push @DP, $x[36];
push @NB, $x[37];
push @CC, $x[38];
push @OVL, $x[39];
push @XTLK, $x[40];
push @GQ, $x[41];
push @UD1, $x[42];
push @UD2, $x[43];
push @UD3, $x[44];
push @CV, $x[45];

}
close FILE;


#@Matrix=(\@SampleFile,\@SampleName,\@SampleID,\@RunName,\@Panel, \@Marker,\@Dye,\@SNP,\@Allele1,\@Allele2,\@Size1,\@Size2,\@Height1, \@Height2,\@PeakArea1,\@PeakArea2,\@DataPoint1,\@DataPoint2,\ @Mutation1,\@Mutation2,\@AEComment1,\@AEComment2,\@ADO, \@AE,\@OMIT,\@OS,\@SHP,\@OBA,\@SPA,\@SP,\@BIN,\@PHR,\@LPH, \@SPU,\@AN,\@BD,\@DP,\@NB,\@CC,\@OVL,\@XTLK,\@GQ,\@UD1,\@UD2, \@UD3,\@CV);
#print "test4 @Matrix\n";


#$MatrixRef=\@Matrix;
#print $MatrixRef;

for ($i=0;$i<@GQ;++$i)
{
if ($GQ[$i]!=1)
{
$Allele1[$i]="";
$Allele2[$i]="";

}
#Test the GQ not =1 the genotype should be nil
# print "$UD1[$i]\t"."$Allele1[$i]\t"."$Allele2[$i]\n";

# create a array to have sample name and marker as well
$UD1marker=$UD1[$i]."-".$Marker[$i];
push @UD1Marker, $UD1marker ;
# create a hash: key: sample name and marker ; value: allele1 and allele2
$UD1MarkerAllele1{$UD1Marker[$i]}=$Allele1[$i];
$UD1MarkerAllele2{$UD1Marker[$i]}=$Allele2[$i];


}

-jlow-

Hi,

First, your sample output file is missing.

The key to your problem is use the head row info for data storage.

I would use a array of hash to store the data (seems there is no unique identifier for each row, so cannot use hash). Here is the semi pseudocode:

CODE
my $is_head = 0;
my @head
my @data; #array of hash
while (<FILE>)
{
if ($is_head ==0){
   #read the head row  
   @head=split(/\t/);
   $is_head = 1; #head is in, the rest are data
}
#read the rest
 my @line = split(/\t/);
 my $hash #temp hash ref
 for ($i=0;$i<@head; $i++){
     $hash->{$head[$i]}= $line[$i];  #assign title to each piece of data
 }
push @data, $hash  #store data in array of hash
}

#then you can manipulate the data
for $line (@data){
 if ($line->{GC} ne 1){
     $line->{Allele1) = '';
  #I don't quite understand your 2nd question but it's easier to do any manipulation once the data structure is right
}

Hope that helps.

-sage-

Hi Sage,
Thank you for your help. It is helpful.
However, I still could not understand what is the way to work with 2 or more duplicate UD1 and Marker name. See 1.txt.
Thanks
Jin

-jlow-

QUOTE
Check the same sample & marker name in UD1 whether have same Allele1 and Allele 2 or not.( I think I should use hash but cannot do it)


I don't quite understand this one, can you explain it in more details?

-sage-

I have just changed the input file for the script to make it simpler. Anything in the cell of the table is not manipulated I put in as “NA”.
Flow chart:
· Input file for the script: 1.txt
· The PERL script
· Output file 1 from the script: GQfile.txt
Output file 2 from the script: CheckDuplicateResult.txt
Output file 3 from the script: Format.txt

Details:
· The input file 1.txt is as attached file here
· The script I have worked so far is below.
· I have done the Output file 1 from the script: GQfile.txt (see the attachment). GQfile.txt is to achieve any GQ is not 1, change all the figures in the Allele1 and Allele 2 columns into “” nil. See the attachment.
· CheckDuplicateResult.txt is to check the same sample & marker name in UD1 whether have same Allele1 and Allele 2 or not.( I think I should use hash but cannot do it)
Marker Allele 1 Allele 2 UD1 UD2 UD3 CV
BM4045.3FR 112 112 C3 Dup1 NA
BM4045.3FR 112 112 C3 Dup2 NA
BM4045.3FR 112 112 C3 Dup3 NA
BMS527.3FR 177 177 C3 Dup1 NA
BMS527.3FR 177 177 C3 Dup2 NA
BMS527.3FR 177 177 C3 Dup3 NA

CheckDuplicateResult.txt should give the out put as
Marker Allele 1 Allele 2 UD1 UD2 UD3 NewCol
BM4045.3FR 112 112 C3 Dup2 & Dup3 NA Dup1 is different: Allele2 is 112
BMS527.3FR 177 177 C3 Dup1&Dup2 & Dup3 NA



3. Format.txt: We ‘ll discuss it later
Thank you
Jin

==========Script =====================
$GenemapperFile ="c:/1.txt";
open (FILE, "<$GenemapperFile") or die "Unable to open the file $GenemapperFile;$!";
$tableheader=readline(FILE); # get rid of headers
while (<FILE>)
{
$size = length $line;
#print "The size of each line is $size\n"; # output size of line
@x=split(/\t/);
push @SampleFile, $x[0];
push @SampleName, $x[1];
push @SampleID, $x[2];
push @RunName, $x[3];
push @Panel, $x[4];
push @Marker, $x[5];
push @Dye, $x[6];
push @SNP, $x[7];
push @Allele1, $x[8];
push @Allele2, $x[9];
push @Size1, $x[10];
push @Size2, $x[11];
push @Height1, $x[12];
push @Height2, $x[13];
push @PeakArea1, $x[14];
push @PeakArea2, $x[15];
push @DataPoint1, $x[16];
push @DataPoint2, $x[17];
push @Mutation1, $x[18];
push @Mutation2, $x[19];
push @AEComment1, $x[20];
push @AEComment2, $x[21];
push @ADO, $x[22];
push @AE, $x[23];
push @OMIT, $x[24];
push @OS, $x[25];
push @SHP, $x[26];
push @OBA, $x[27];
push @SPA, $x[28];
push @SP, $x[29];
push @BIN, $x[30];
push @PHR, $x[31];
push @LPH, $x[32];
push @SPU, $x[33];
push @AN, $x[34];
push @BD, $x[35];
push @DP, $x[36];
push @NB, $x[37];
push @CC, $x[38];
push @OVL, $x[39];
push @XTLK, $x[40];
push @GQ, $x[41];
push @UD1, $x[42];
push @UD2, $x[43];
push @UD3, $x[44];
push @CV, $x[45];

}
close FILE;


#@Matrix=(\@SampleFile,\@SampleName,\@SampleID,\@RunName,\@Panel, \@Marker,\@Dye,\@SNP,\@Allele1,\@Allele2,\@Size1,\@Size2,\@Height1, \@Height2,\@PeakArea1,\@PeakArea2,\@DataPoint1,\@DataPoint2, \@Mutation1,\@Mutation2,\@AEComment1,\@AEComment2,\@ADO, \@AE,\@OMIT,\@OS,\@SHP,\@OBA,\@SPA,\@SP,\@BIN,\@PHR,\@LPH, \@SPU,\@AN,\@BD,\@DP,\@NB,\@CC,\@OVL,\@XTLK,\@GQ,\@UD1, \@UD2,\@UD3,\@CV);
#print "test4 @Matrix\n";


#$MatrixRef=\@Matrix;
#print $MatrixRef;

for ($i=0;$i<@GQ;++$i)
{
if ($GQ[$i]!=1)
{
$Allele1[$i]="";
$Allele2[$i]="";

}
#Test the GQ not =1 the genotype should be nil
$OutputFile1="c:\GQfile.txt";
open File, ">$OutputFile1" or die "Unable to open $OutputFile1:$!";
print File "$UD1[$i]\t"."$Allele1[$i]\t"."$Allele2[$i]\n";
close File;

# create a array to have sample name and marker as well
$UD1marker=$UD1[$i]."-".$Marker[$i];
push @UD1Marker, $UD1marker ;
# create a hash: key: sample name and marker ; value: allele1 and allele2
#$UD1MarkerAllele1{$UD1Marker[$i]}=$Allele1[$i];
#$UD1MarkerAllele2{$UD1Marker[$i]}=$Allele2[$i];


}

-jlow-

I spent one hour trying to understand your exmple in vain.

QUOTE
· CheckDuplicateResult.txt is to check the same sample & marker name in UD1 whether have same Allele1 and Allele 2 or not.( I think I should use hash but cannot do it)
Marker Allele 1 Allele 2 UD1 UD2 UD3 CV
BM4045.3FR 112 112 C3 Dup1 NA
BM4045.3FR 112 112 C3 Dup2 NA
BM4045.3FR 112 112 C3 Dup3 NA
BMS527.3FR 177 177 C3 Dup1 NA
BMS527.3FR 177 177 C3 Dup2 NA
BMS527.3FR 177 177 C3 Dup3 NA

CheckDuplicateResult.txt should give the out put as
Marker Allele 1 Allele 2 UD1 UD2 UD3 NewCol
BM4045.3FR 112 112 C3 Dup2 & Dup3 NA Dup1 is different: Allele2 is 112
BMS527.3FR 177 177 C3 Dup1&Dup2 & Dup3 NA


For BM4045.3FR, Dup 1-3 all have the same allele "112", why in your sampe output you said "Dup1 is different: Allele2 is 112"?

For BMS527.3FR, I can understand Dup1&Dup2 & Dup are the same.

To me, the two marker should have the same output.

-sage-