Protocol Online logo
Top : New Forum Archives (2009-): : Bioinformatics and Biostatistics

Looking for a method to filter out data from related BLAST results - (Jun/15/2012 )

I am a new member on this forum and I like what I have seen so far. I am currently running a miRNA search as a part of my undergraduate honors thesis. I am analyzing expressed sequence tags from a fern species and using them to search (blastn) against the miRNA database on mirbase (both mature and hairpin sequences). To search for non-coding ESTs I am searching (blastx) against the plant database on UniProt. I have complete the initial searches which have yielded quite a bit of data from only about ~5000 ESTs... I have to now filter out all the protein coding ESTs (evidenced by the blastx result) from my blastn result to end up with only non-protein coding ESTs and their respective mature/hairpin alignments.

Does anyone have a streamlined method for comparing two datasets? I realize that I can manually go through each blastn hit and check the blastx result to see if it is significant or not, but that would be tedious and time consuming (and error prone).

Eager to hear some solutions!


Try a blast parser, galaxy has a nice one and what about doing phylogenetic analysis from your blast results, and then you can compare the trees



Using Galaxy ( tools I can convert the columns of interest using Convert delimiters to TAB, to remove white spaces and then using Join, Subtract, Group -Compare two Datasets tool to find common or distinct rows to display only non-matching alignments. Which is an okay method.

The method I went with is excel LOOKUP formula

For example

Copying the query id column from blastx with %identity>85 into the blastn results column AA, you can create a new column , B and use =lookup(A2,$AA$1:$AA$50,000). Then copy the formula into all B rows adjacent to an A row , until the last A row. Hope that made sense. Once the formula finishes you copy A and B and paste as values into AB to enable further data sorting.

This formula yielded better results than the galaxy compare tool because it is still in excel, no need to use .txt delimited.