mothur
#1
Posted 23 July 2009 - 10:42 AM
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.
in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.
#2
Posted 23 July 2009 - 11:11 AM
303microbialist, on Jul 23 2009, 12:42 PM, said:
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.
in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.
Ah, biopython seems to be the tool kit for these type of issues. Tutorial time.
Edited by 303microbialist, 23 July 2009 - 11:12 AM.
#3
Posted 17 August 2009 - 11:21 PM
303microbialist, on Jul 23 2009, 11:42 AM, said:
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.
in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.
Forget Biopython get a copy of Bioedit. This allows you to edit/cut/paste the sequence names independant of the sequence.
J
#4
Posted 18 August 2009 - 04:56 AM
B10988
B10989
B10990
B10991
use:
#!/usr/bin/perl -w
use strict;
open (IN, "filename.ext") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">fasta_names.txt") or die "Couldn't open fasta_names.txt: $!\n";
while (<IN>) {
if (/^>(.+)/) {
print OUT "$1\n";
}
}If you want a tab-delimited *row*, like:
B10988<tab>B10989<tab>B10990<tab>B10991
use:
#!/usr/bin/perl -w
use strict;
open (IN, "filename.ext") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">fasta_names.txt") or die "Couldn't open fasta_names.txt: $!\n";
while (<IN>) {
if (/^>(.+)/) {
print OUT "$1\t";
}
}
print OUT "\n";This assumes that there's nothing on the FASTA description line other than the gene name. If there's something else, like a description of the gene following its name, change the line:
if (/^>(.+)/) {to:
if (/^>(.+?)\s/) {so it will capture all text following the > up until the first space it encounters.
#5
Posted 18 August 2009 - 06:27 AM
grep '>' input_filename.txt > output_filename.txt
to make sure you get only those starting with > you can replace the above:
grep '^>' input_filename.txt > output_filename.txt
you can then use awk to get at specific columns
Edited by perlmunky, 18 August 2009 - 06:28 AM.
#6
Posted 18 August 2009 - 02:14 PM
perl -nle "print for /^>(.+?)\s/g" your_fasta_file.ext > gene_names.txt
for a column of gene names, or:
perl -ne "while(/^>(.+?)\s/g){print \"$1\t\"}" your_fasta_file.ext > gene_names.txtfor a row of tab-delimited gene names.
#7
Posted 20 August 2009 - 12:32 AM
HomeBrew, on Aug 18 2009, 02:14 PM, said:
perl -nle "print for /^>(.+?)\s/g" your_fasta_file.ext > gene_names.txt
for a column of gene names, or:
perl -ne "while(/^>(.+?)\s/g){print \"$1\t\"}" your_fasta_file.ext > gene_names.txtfor a row of tab-delimited gene names.
ok, now I have to look at how to do this with assembly.
#8
Posted 20 August 2009 - 05:00 AM
perlmunky, on Aug 20 2009, 04:32 AM, said:
dosseg .model small .stack 100h .data hello_message db 'Hello, World!',0dh,0ah,'$' .code main proc mov ax,@data mov ds,ax mov ah,9 mov dx,offset hello_message int 21h mov ax,4C00h int 21h main endp end main
No, wait -- that's not right...
#9
Posted 10 September 2009 - 04:39 PM
I did figure out how to do it easily in BioPython too if anyone's interested:
>>> seq_rec_list = [seq_record.id for seq_record in SeqIO.parse(input_handle, "f
asta")]
>>> seq_rec_list
>>> seq_rec_string = '\n'.join(seq_rec_list)
>>> output_handle.write(seq_rec_string)
input_handle is of course the path to your fasta file.
I'll have to check BioEdit out though.
Edited by 303microbialist, 10 September 2009 - 04:43 PM.
#10
Posted 11 September 2009 - 12:53 AM
You can make that code more compact or less if you desire (I prefer not to use compact code as it can be nasty when I come back to it - or someone else has to use it)
Short:
This takes your list compression and does it in one sitting so that you don't have to perform the "\n".join(list)
#!/usr/bin/python
from Bio import SeqIO
output = open("my_output.txt", "w")
[output.write(rec.id + "\n") for rec in SeqIO.parse(open("my_input.txt", 'r'), "fasta")]Longer:
Typically I wouldn't do this either. I would open the file handle with various checks (size, does it exist etc), then loop over the file.
#!/usr/bin/python
from Bio import SeqIO
output = open("my_output.txt", "w")
for rec in SeqIO.parse( open("my_input.txt", 'r'), "fasta" ):
output.write( rec.id + "\n")<I am trying *hard* to avoid my python code at the moment













