Protocol Online logo
Top : New Forum Archives (2009-): : Bioinformatics and Biostatistics

ATCG content calculation - (Jul/06/2009 )

Hello everyone,
Is there a program can calculate the content of a,t,c and g of each gene in E. coli ?
I want to know the frequency of each nucleotide.
Thank you very much.


Find the sequence from NCBI nucleotide database and paste it here


Perl can do it fairly easily. Here's a quick-and-dirty script:

#!/usr/bin/perl -w
use strict;

my $input_file = "e.coli.nuc";
my $output_file = "e.coli.freq.txt";

my $count = 0;

open (IN, $input_file) or die "Can't find a file called $input_file! $!\n";
open (OUT, ">$output_file") or die "Can't open $output_file! $!\n";

$/ = ">";

while (<IN>) {
my ($gene, $description, $seq) = ($_ =~ /(.+?)\s+(.+?)\n([aAtTgGcCnN\n]+)/);
next unless ($gene);
$seq =~ s/\n//g;
my $A = ($seq =~ tr/Aa/Aa/);
my $G = ($seq =~ tr/Gg/Gg/);
my $C = ($seq =~ tr/Cc/Cc/);
my $T = ($seq =~ tr/Tt/Tt/);
my $N = ($seq =~ tr/Nn/Nn/);
my $bp = length($seq);
my $gc = sprintf("%0.1f", 100 * (($G + $C)/$bp));
print OUT ">$gene $description\n\tA = $A, G = $G, C = $C, T = $T, N = $N, total bp = $bp, %GC = $gc\n";

print "Finished $count genes from $input_file\n";

The script takes as input a Fasta formatted file of DNA sequences, in this case "e.coli.nuc", which is the Escherichia coli K-12 MG1655 DNA ORFs dowloaded from, in the 'eco' directory.

There are actually 24 E. coli genomes available there, identifiable by their three-letter code:

T00007(1997) eco E.coli Escherichia coli K-12 MG1655
T00068(2001) ecj E.coli_J Escherichia coli K-12 W3110
T00666(2008) ecd E.coli_DH10B Escherichia coli K-12 DH10B
T00913(2009) ebw E.coli_BW2952 Escherichia coli BW2952
T00044(2001) ece E.coli_O157 Escherichia coli O157 EDL933 (EHEC)
T00048(2001) ecs E.coli_O157J Escherichia coli O157 Sakai (EHEC)
T00778(2008) ecf E.coli_O157_EC4115 Escherichia coli O157 EC4115 (EHEC)
T00106(2002) ecc E.coli_CFT073 Escherichia coli CFT073 (UPEC)
T00338(2006) eci E.coli_UTI89 Escherichia coli UTI89 (UPEC)
T00373(2006) ecp E.coli_536 Escherichia coli 536 (UPEC)
T00425(2006) ecv E.coli_APEC Escherichia coli APEC O1
T00590(2007) ecw E.coli_E24377A Escherichia coli E24377A
T00591(2007) ecx E.coli_HS Escherichia coli HS
T00672(2008) ecm E.coli_SECEC Escherichia coli SECEC
T00784(2008) ecy E.coli_SE11 Escherichia coli SE11
T00697(2008) ecl E.coli_ATCC8739 Escherichia coli ATCC 8739
T00796(2008) ecg E.coli_0127 Escherichia coli O127
T00826(2009) eck E.coli_55989 Escherichia coli 55989
T00827(2009) ecq E.coli_ED1a Escherichia coli ED1a
T00828(2009) ecr E.coli_IAI1 Escherichia coli IAI1
T00829(2009) ect E.coli_IAI39 Escherichia coli IAI39
T00830(2009) ecz E.coli_S88 Escherichia coli S88
T00831(2009) eum E.coli_UMN026 Escherichia coli UMN026
T00854(2009) elf E.coli_LF82 Escherichia coli LF82

You can format the print out anyway you like by changing the "print" line. As written, the script outputs the data to a file called "e.coli.freq.txt", but you're free to change the output or input files as you like.

On my machine, it takes less than a second to give you the base frequences, total base count, and %GC for all 4.493 genes from this E. coli genome.


Oh, that's great!
It really helps.
Thank you, HomeBrew and Bioforum.