Jump to content

  • Log in with Facebook Log in with Twitter Log in with Windows Live Log In with Google      Sign In   
  • Create Account

Submit your paper to J Biol Methods today!
Photo
- - - - -

mothur


  • Please log in to reply
9 replies to this topic

#1 303microbialist

303microbialist

    member

  • Active Members
  • Pip
  • 17 posts
0
Neutral

Posted 23 July 2009 - 10:42 AM

hey all,
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.

in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.

#2 303microbialist

303microbialist

    member

  • Active Members
  • Pip
  • 17 posts
0
Neutral

Posted 23 July 2009 - 11:11 AM

hey all,
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.

in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.



Ah, biopython seems to be the tool kit for these type of issues. Tutorial time.

Edited by 303microbialist, 23 July 2009 - 11:12 AM.


#3 Jugsy

Jugsy

    member

  • Members
  • Pip
  • 1 posts
0
Neutral

Posted 17 August 2009 - 11:21 PM

hey all,
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.

in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.


Forget Biopython get a copy of Bioedit. This allows you to edit/cut/paste the sequence names independant of the sequence.

J

#4 HomeBrew

HomeBrew

    Veteran

  • Global Moderators
  • PipPipPipPipPipPipPipPipPipPip
  • 930 posts
16
Good

Posted 18 August 2009 - 04:56 AM

Perl can do this, but I'm not sure what you mean by a "tab-delimited column". If you want a column of FASTA names like:

B10988
B10989
B10990
B10991

use:

#!/usr/bin/perl -w
use strict;

open (IN, "filename.ext") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">fasta_names.txt") or die "Couldn't open fasta_names.txt: $!\n";

while (<IN>) {
   if (/^>(.+)/) {
	  print OUT "$1\n";
   }
}

If you want a tab-delimited *row*, like:

B10988<tab>B10989<tab>B10990<tab>B10991

use:

#!/usr/bin/perl -w
use strict;

open (IN, "filename.ext") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">fasta_names.txt") or die "Couldn't open fasta_names.txt: $!\n";

while (<IN>) {
   if (/^>(.+)/) {
	  print OUT "$1\t";
   }
}

print OUT "\n";

This assumes that there's nothing on the FASTA description line other than the gene name. If there's something else, like a description of the gene following its name, change the line:

if (/^>(.+)/) {

to:

if (/^>(.+?)\s/) {

so it will capture all text following the > up until the first space it encounters.

#5 DELETEMYACCOUNTPLEASE

DELETEMYACCOUNTPLEASE

    Y U NOT DELETE MY ACCOUNT?

  • Active Members
  • PipPipPipPipPip
  • 58 posts
1
Neutral

Posted 18 August 2009 - 06:27 AM

on a *nix machine

grep '>' input_filename.txt > output_filename.txt

to make sure you get only those starting with > you can replace the above:
grep '^>' input_filename.txt > output_filename.txt

you can then use awk to get at specific columns

:)

Edited by perlmunky, 18 August 2009 - 06:28 AM.


#6 HomeBrew

HomeBrew

    Veteran

  • Global Moderators
  • PipPipPipPipPipPipPipPipPipPip
  • 930 posts
16
Good

Posted 18 August 2009 - 02:14 PM

Ahh - the difficult to read but incredibly useful world of Perl one-liners (these are written for Windows):

perl -nle "print for /^>(.+?)\s/g" your_fasta_file.ext > gene_names.txt

for a column of gene names, or:

perl -ne "while(/^>(.+?)\s/g){print \"$1\t\"}" your_fasta_file.ext > gene_names.txt

for a row of tab-delimited gene names.

:lol:

#7 DELETEMYACCOUNTPLEASE

DELETEMYACCOUNTPLEASE

    Y U NOT DELETE MY ACCOUNT?

  • Active Members
  • PipPipPipPipPip
  • 58 posts
1
Neutral

Posted 20 August 2009 - 12:32 AM

Ahh - the difficult to read but incredibly useful world of Perl one-liners (these are written for Windows):

perl -nle "print for /^>(.+?)\s/g" your_fasta_file.ext > gene_names.txt

for a column of gene names, or:

perl -ne "while(/^>(.+?)\s/g){print \"$1\t\"}" your_fasta_file.ext > gene_names.txt

for a row of tab-delimited gene names.

:lol:


ok, now I have to look at how to do this with assembly.

#8 HomeBrew

HomeBrew

    Veteran

  • Global Moderators
  • PipPipPipPipPipPipPipPipPipPip
  • 930 posts
16
Good

Posted 20 August 2009 - 05:00 AM

ok, now I have to look at how to do this with assembly.


dosseg
.model small
.stack 100h

.data
hello_message db 'Hello, World!',0dh,0ah,'$'

.code
main  proc
	  mov	ax,@data
	  mov	ds,ax

	  mov	ah,9
	  mov	dx,offset hello_message
	  int	21h

	  mov	ax,4C00h
	  int	21h
main  endp
end   main

No, wait -- that's not right...:lol:

#9 303microbialist

303microbialist

    member

  • Active Members
  • Pip
  • 17 posts
0
Neutral

Posted 10 September 2009 - 04:39 PM

Thanks everyone!

I did figure out how to do it easily in BioPython too if anyone's interested:

>>> seq_rec_list = [seq_record.id for seq_record in SeqIO.parse(input_handle, "f
asta")]

>>> seq_rec_list

>>> seq_rec_string = '\n'.join(seq_rec_list)

>>> output_handle.write(seq_rec_string)

input_handle is of course the path to your fasta file.

I'll have to check BioEdit out though.

Edited by 303microbialist, 10 September 2009 - 04:43 PM.


#10 DELETEMYACCOUNTPLEASE

DELETEMYACCOUNTPLEASE

    Y U NOT DELETE MY ACCOUNT?

  • Active Members
  • PipPipPipPipPip
  • 58 posts
1
Neutral

Posted 11 September 2009 - 12:53 AM

List compressions eh?

You can make that code more compact or less if you desire (I prefer not to use compact code as it can be nasty when I come back to it - or someone else has to use it)

Short:
This takes your list compression and does it in one sitting so that you don't have to perform the "\n".join(list)
#!/usr/bin/python
from Bio import SeqIO
output = open("my_output.txt", "w")	
[output.write(rec.id + "\n") for rec in SeqIO.parse(open("my_input.txt", 'r'), "fasta")]

Longer:
Typically I wouldn't do this either. I would open the file handle with various checks (size, does it exist etc), then loop over the file.
#!/usr/bin/python
from Bio import SeqIO
output = open("my_output.txt", "w")
for rec in SeqIO.parse( open("my_input.txt", 'r'), "fasta" ):
	output.write( rec.id + "\n")

<I am trying *hard* to avoid my python code at the moment :lol: >




Home - About - Terms of Service - Privacy - Contact Us

©1999-2013 Protocol Online, All rights reserved.