Jump to content


- - - - -

mothur


9 replies to this topic

#1 303microbialist

    member

  • Active Members
  • Pip
  • 17 posts

Posted 23 July 2009 - 10:42 AM

hey all,
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.

in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.

#2 303microbialist

    member

  • Active Members
  • Pip
  • 17 posts

Posted 23 July 2009 - 11:11 AM

View Post303microbialist, on Jul 23 2009, 12:42 PM, said:

hey all,
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.

in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.



Ah, biopython seems to be the tool kit for these type of issues. Tutorial time.

Edited by 303microbialist, 23 July 2009 - 11:12 AM.


#3 Jugsy

    member

  • Members
  • Pip
  • 1 posts

Posted 17 August 2009 - 11:21 PM

View Post303microbialist, on Jul 23 2009, 11:42 AM, said:

hey all,
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.

in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.


Forget Biopython get a copy of Bioedit. This allows you to edit/cut/paste the sequence names independant of the sequence.

J

#4 HomeBrew

    Veteran

  • Moderator
  • PipPipPipPipPip
  • 950 posts

Posted 18 August 2009 - 04:56 AM

Perl can do this, but I'm not sure what you mean by a "tab-delimited column". If you want a column of FASTA names like:

B10988
B10989
B10990
B10991

use:

#!/usr/bin/perl -w
use strict;

open (IN, "filename.ext") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">fasta_names.txt") or die "Couldn't open fasta_names.txt: $!\n";

while (<IN>) {
   if (/^>(.+)/) {
	  print OUT "$1\n";
   }
}


If you want a tab-delimited *row*, like:

B10988<tab>B10989<tab>B10990<tab>B10991

use:

#!/usr/bin/perl -w
use strict;

open (IN, "filename.ext") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">fasta_names.txt") or die "Couldn't open fasta_names.txt: $!\n";

while (<IN>) {
   if (/^>(.+)/) {
	  print OUT "$1\t";
   }
}

print OUT "\n";


This assumes that there's nothing on the FASTA description line other than the gene name. If there's something else, like a description of the gene following its name, change the line:

   if (/^>(.+)/) {


to:

   if (/^>(.+?)\s/) {


so it will capture all text following the > up until the first space it encounters.

#5 perlmunky

    The Evil

  • Active Members
  • PipPip
  • 68 posts

Posted 18 August 2009 - 06:27 AM

on a *nix machine

grep '>' input_filename.txt > output_filename.txt

to make sure you get only those starting with > you can replace the above:
grep '^>' input_filename.txt > output_filename.txt

you can then use awk to get at specific columns

:)

Edited by perlmunky, 18 August 2009 - 06:28 AM.


#6 HomeBrew

    Veteran

  • Moderator
  • PipPipPipPipPip
  • 950 posts

Posted 18 August 2009 - 02:14 PM

Ahh - the difficult to read but incredibly useful world of Perl one-liners (these are written for Windows):

perl -nle "print for /^>(.+?)\s/g" your_fasta_file.ext > gene_names.txt


for a column of gene names, or:

perl -ne "while(/^>(.+?)\s/g){print \"$1\t\"}" your_fasta_file.ext > gene_names.txt


for a row of tab-delimited gene names.

:lol:

#7 perlmunky

    The Evil

  • Active Members
  • PipPip
  • 68 posts

Posted 20 August 2009 - 12:32 AM

View PostHomeBrew, on Aug 18 2009, 02:14 PM, said:

Ahh - the difficult to read but incredibly useful world of Perl one-liners (these are written for Windows):

perl -nle "print for /^>(.+?)\s/g" your_fasta_file.ext > gene_names.txt


for a column of gene names, or:

perl -ne "while(/^>(.+?)\s/g){print \"$1\t\"}" your_fasta_file.ext > gene_names.txt


for a row of tab-delimited gene names.

:lol:


ok, now I have to look at how to do this with assembly.

#8 HomeBrew

    Veteran

  • Moderator
  • PipPipPipPipPip
  • 950 posts

Posted 20 August 2009 - 05:00 AM

View Postperlmunky, on Aug 20 2009, 04:32 AM, said:

ok, now I have to look at how to do this with assembly.


dosseg
.model small
.stack 100h

.data
hello_message db 'Hello, World!',0dh,0ah,'$'

.code
main  proc
	  mov	ax,@data
	  mov	ds,ax

	  mov	ah,9
	  mov	dx,offset hello_message
	  int	21h

	  mov	ax,4C00h
	  int	21h
main  endp
end   main


No, wait -- that's not right...:lol:

#9 303microbialist

    member

  • Active Members
  • Pip
  • 17 posts

Posted 10 September 2009 - 04:39 PM

Thanks everyone!

I did figure out how to do it easily in BioPython too if anyone's interested:

>>> seq_rec_list = [seq_record.id for seq_record in SeqIO.parse(input_handle, "f
asta")]

>>> seq_rec_list

>>> seq_rec_string = '\n'.join(seq_rec_list)

>>> output_handle.write(seq_rec_string)

input_handle is of course the path to your fasta file.

I'll have to check BioEdit out though.

Edited by 303microbialist, 10 September 2009 - 04:43 PM.


#10 perlmunky

    The Evil

  • Active Members
  • PipPip
  • 68 posts

Posted 11 September 2009 - 12:53 AM

List compressions eh?

You can make that code more compact or less if you desire (I prefer not to use compact code as it can be nasty when I come back to it - or someone else has to use it)

Short:
This takes your list compression and does it in one sitting so that you don't have to perform the "\n".join(list)
#!/usr/bin/python
from Bio import SeqIO
output = open("my_output.txt", "w")	
[output.write(rec.id + "\n") for rec in SeqIO.parse(open("my_input.txt", 'r'), "fasta")]


Longer:
Typically I wouldn't do this either. I would open the file handle with various checks (size, does it exist etc), then loop over the file.
#!/usr/bin/python
from Bio import SeqIO
output = open("my_output.txt", "w")
for rec in SeqIO.parse( open("my_input.txt", 'r'), "fasta" ):
	output.write( rec.id + "\n")


<I am trying *hard* to avoid my python code at the moment :lol: >





Home - About - Terms of Service - Privacy - Contact Us

©1999-2011 Protocol Online, All rights reserved.