Protocol Online logo
Top : New Forum Archives (2009-): : Bioinformatics and Biostatistics

mothur - i'm stupid (Jul/23/2009 )

hey all,
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.

in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.

-303microbialist-

303microbialist on Jul 23 2009, 12:42 PM said:

hey all,
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.

in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.



Ah, biopython seems to be the tool kit for these type of issues. Tutorial time.

-303microbialist-

303microbialist on Jul 23 2009, 11:42 AM said:

hey all,
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.

in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.


Forget Biopython get a copy of Bioedit. This allows you to edit/cut/paste the sequence names independant of the sequence.

J

-Jugsy-

Perl can do this, but I'm not sure what you mean by a "tab-delimited column". If you want a column of FASTA names like:

B10988
B10989
B10990
B10991

use:

#!/usr/bin/perl -w
use strict;

open (IN, "filename.ext") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">fasta_names.txt") or die "Couldn't open fasta_names.txt: $!\n";

while (<IN>) {
if (/^>(.+)/) {
print OUT "$1\n";
}
}


If you want a tab-delimited *row*, like:

B10988<tab>B10989<tab>B10990<tab>B10991

use:

#!/usr/bin/perl -w
use strict;

open (IN, "filename.ext") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">fasta_names.txt") or die "Couldn't open fasta_names.txt: $!\n";

while (<IN>) {
if (/^>(.+)/) {
print OUT "$1\t";
}
}

print OUT "\n";


This assumes that there's nothing on the FASTA description line other than the gene name. If there's something else, like a description of the gene following its name, change the line:

if (/^>(.+)/) {

to:

if (/^>(.+?)\s/) {

so it will capture all text following the > up until the first space it encounters.

-HomeBrew-

on a *nix machine

grep '>' input_filename.txt > output_filename.txt

to make sure you get only those starting with > you can replace the above:
grep '^>' input_filename.txt > output_filename.txt

you can then use awk to get at specific columns

:)

-perlmunky-

Ahh - the difficult to read but incredibly useful world of Perl one-liners (these are written for Windows):

perl -nle "print for /^>(.+?)\s/g" your_fasta_file.ext > gene_names.txt

for a column of gene names, or:

perl -ne "while(/^>(.+?)\s/g){print \"$1\t\"}" your_fasta_file.ext > gene_names.txt

for a row of tab-delimited gene names.

:lol:

-HomeBrew-

HomeBrew on Aug 18 2009, 02:14 PM said:

Ahh - the difficult to read but incredibly useful world of Perl one-liners (these are written for Windows):

perl -nle "print for /^>(.+?)\s/g" your_fasta_file.ext > gene_names.txt

for a column of gene names, or:

perl -ne "while(/^>(.+?)\s/g){print \"$1\t\"}" your_fasta_file.ext > gene_names.txt

for a row of tab-delimited gene names.

:lol:


ok, now I have to look at how to do this with assembly.

-perlmunky-

perlmunky on Aug 20 2009, 04:32 AM said:

ok, now I have to look at how to do this with assembly.


dosseg
.model small
.stack 100h

.data
hello_message db 'Hello, World!',0dh,0ah,'$'

.code
main proc
mov ax,@data
mov ds,ax

mov ah,9
mov dx,offset hello_message
int 21h

mov ax,4C00h
int 21h
main endp
end main


No, wait -- that's not right...:lol:

-HomeBrew-

Thanks everyone!

I did figure out how to do it easily in BioPython too if anyone's interested:

>>> seq_rec_list = asta")>

>>> seq_rec_list

>>> seq_rec_string = '\n'.join(seq_rec_list)

>>> output_handle.write(seq_rec_string)

input_handle is of course the path to your fasta file.

I'll have to check BioEdit out though.

-303microbialist-

List compressions eh?

You can make that code more compact or less if you desire (I prefer not to use compact code as it can be nasty when I come back to it - or someone else has to use it)

Short:
This takes your list compression and does it in one sitting so that you don't have to perform the "\n".join(list)
#!/usr/bin/python
from Bio import SeqIO
output = open("my_output.txt", "w")
[output.write(rec.id + "\n") for rec in SeqIO.parse(open("my_input.txt", 'r'), "fasta")]


Longer:
Typically I wouldn't do this either. I would open the file handle with various checks (size, does it exist etc), then loop over the file.
#!/usr/bin/python
from Bio import SeqIO
output = open("my_output.txt", "w")
for rec in SeqIO.parse( open("my_input.txt", 'r'), "fasta" ):
output.write( rec.id + "\n")


<I am trying *hard* to avoid my python code at the moment :lol: >

-perlmunky-