Jump to content

  • Log in with Facebook Log in with Twitter Log in with Windows Live Log In with Google      Sign In   
  • Create Account

Submit your paper to J Biol Methods today!
- - - - -

ATCG content calculation

  • Please log in to reply
3 replies to this topic

#1 muchmoa



  • Active Members
  • Pip
  • 6 posts

Posted 06 July 2009 - 05:17 PM

Hello everyone,
Is there a program can calculate the content of a,t,c and g of each gene in E. coli ?
I want to know the frequency of each nucleotide.
Thank you very much.

#2 bioforum



  • Super Administrators
  • PipPipPipPipPipPipPipPipPipPip
  • 165 posts

Posted 06 July 2009 - 06:26 PM

Find the sequence from NCBI nucleotide database and paste it here http://www.protocol-.../dna_stats.html

#3 HomeBrew



  • Global Moderators
  • PipPipPipPipPipPipPipPipPipPip
  • 930 posts

Posted 06 July 2009 - 06:59 PM

Perl can do it fairly easily. Here's a quick-and-dirty script:

#!/usr/bin/perl -w
use strict;

my $input_file = "e.coli.nuc";
my $output_file = "e.coli.freq.txt";

my $count = 0;

open (IN, $input_file) or die "Can't find a file called $input_file! $!\n";
open (OUT, ">$output_file") or die "Can't open $output_file! $!\n";

$/ = ">";

while (<IN>) {
	my ($gene, $description, $seq) = ($_ =~ /(.+?)\s+(.+?)\n([aAtTgGcCnN\n]+)/);
	next unless ($gene);
	$seq =~ s/\n//g;
	my $A = ($seq =~ tr/Aa/Aa/);
	my $G = ($seq =~ tr/Gg/Gg/);
	my $C = ($seq =~ tr/Cc/Cc/);
	my $T = ($seq =~ tr/Tt/Tt/);
	my $N = ($seq =~ tr/Nn/Nn/);
	my $bp = length($seq);
	my $gc = sprintf("%0.1f", 100 * (($G + $C)/$bp));
	print OUT ">$gene $description\n\tA = $A, G = $G, C = $C, T = $T, N = $N, total bp = $bp, %GC = $gc\n";

print "Finished $count genes from $input_file\n";

The script takes as input a Fasta formatted file of DNA sequences, in this case "e.coli.nuc", which is the Escherichia coli K-12 MG1655 DNA ORFs dowloaded from ftp://ftp.genome.jp/pub/kegg/genes/organisms/, in the 'eco' directory.

There are actually 24 E. coli genomes available there, identifiable by their three-letter code:

T00007(1997) eco E.coli Escherichia coli K-12 MG1655
T00068(2001) ecj E.coli_J Escherichia coli K-12 W3110
T00666(2008) ecd E.coli_DH10B Escherichia coli K-12 DH10B
T00913(2009) ebw E.coli_BW2952 Escherichia coli BW2952
T00044(2001) ece E.coli_O157 Escherichia coli O157 EDL933 (EHEC)
T00048(2001) ecs E.coli_O157J Escherichia coli O157 Sakai (EHEC)
T00778(2008) ecf E.coli_O157_EC4115 Escherichia coli O157 EC4115 (EHEC)
T00106(2002) ecc E.coli_CFT073 Escherichia coli CFT073 (UPEC)
T00338(2006) eci E.coli_UTI89 Escherichia coli UTI89 (UPEC)
T00373(2006) ecp E.coli_536 Escherichia coli 536 (UPEC)
T00425(2006) ecv E.coli_APEC Escherichia coli APEC O1
T00590(2007) ecw E.coli_E24377A Escherichia coli E24377A
T00591(2007) ecx E.coli_HS Escherichia coli HS
T00672(2008) ecm E.coli_SECEC Escherichia coli SECEC
T00784(2008) ecy E.coli_SE11 Escherichia coli SE11
T00697(2008) ecl E.coli_ATCC8739 Escherichia coli ATCC 8739
T00796(2008) ecg E.coli_0127 Escherichia coli O127
T00826(2009) eck E.coli_55989 Escherichia coli 55989
T00827(2009) ecq E.coli_ED1a Escherichia coli ED1a
T00828(2009) ecr E.coli_IAI1 Escherichia coli IAI1
T00829(2009) ect E.coli_IAI39 Escherichia coli IAI39
T00830(2009) ecz E.coli_S88 Escherichia coli S88
T00831(2009) eum E.coli_UMN026 Escherichia coli UMN026
T00854(2009) elf E.coli_LF82 Escherichia coli LF82

You can format the print out anyway you like by changing the "print" line. As written, the script outputs the data to a file called "e.coli.freq.txt", but you're free to change the output or input files as you like.

On my machine, it takes less than a second to give you the base frequences, total base count, and %GC for all 4.493 genes from this E. coli genome.

#4 muchmoa



  • Active Members
  • Pip
  • 6 posts

Posted 07 July 2009 - 02:11 AM

Oh, that's great!
It really helps.
Thank you, HomeBrew and Bioforum.

Home - About - Terms of Service - Privacy - Contact Us

©1999-2013 Protocol Online, All rights reserved.