Protocol Online logo
Top : New Forum Archives (2009-): : Bioinformatics and Biostatistics

Local BLAST returns too many HSP's - (Dec/06/2010 )

Hello. I am an undergrad in a computational biology lab. This is my first attempt at running BLAST locally. I am using the older (LEGACY) blast executables because I was getting registry errors with the newer ones. I have been struggling with this issue for a while now. Here's the gist of what I've done:

1) Used "formatdb.exe" to convert 25 entire chromosome sequences (25 ".fa" files) into one large genomic database (~760 MB ".nsq" file, others)

The command I used:

formatdb -i "ref_chr1.fa ref_chr2A.fa ref_chr2B.fa (etc...)" -t legacypantrog -p F -n legacypantrog.db


2) Used "blastall.exe" to search for a ~1000 bp sequence in fasta format against the previously made database file

The command I used:

blastall -p blastn -d legacypantrog.db -i testfile.fasta -o testblast.xml -m 7 -v 1 -b 1 -K 1 -e 1e-30

An explanation of the blastall parameters is available here: http://www.plexdb.org/modules/documentation/NCBIblastall.htm

3) The problem:

The output file returns ONE hit, but with ~3000 HSPs! The first alignment in the output is exactly what I want - an alignment of the whole query sequence against a sequence in the database. The rest of the output looks like 3000 of these with variants in base pair length:

Score = 141 bits (71), Expect = 9e-031
Identities = 89/95 (93%)
Strand = Plus / Plus


Query: 273 tctactaaaactacaaaaattagctgggcacggtggcaggcgcctgtaatcccagctact 332
|||||||||| ||||||||| ||||||||| ||||||||||||||||| |||||||||||
Sbjct: 193250233 tctactaaaaatacaaaaatgagctgggcatggtggcaggcgcctgtagtcccagctact 193250292


Query: 333 caggaggctgaggcaggagaatcacttgaacctgg 367
| |||||| ||||||||||||||||||||||||||
Sbjct: 193250293 cgggaggcggaggcaggagaatcacttgaacctgg 193250327


Is there a way to get rid of all these extra HSPs in my output?

The following is a summary from an HTML format output, maybe it can help:

Matrix: blastn matrix:1 -3
Gap Penalties: Existence: 5, Extension: 2
Number of Sequences: 25
Number of Hits to DB: 43,805,126
Number of extensions: 1120422
Number of successful extensions: 1120422
Number of sequences better than 1.0e-030: 25
Number of HSP's gapped: 1091227
Number of HSP's successfully gapped: 75943
Length of query: 971
Length of database: 3,175,582,169
Length adjustment: 21
Effective length of query: 950
Effective length of database: 3,175,581,644
Effective search space: 3016802561800
Effective search space used: 3016802561800


Any help would be greatly appreciated.

-jonmk-

So the BLAST FAQ says this happens when the query contains repeat elements and the database is large. One of their suggestions is to filter out species-specific repeats.

http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#HSPs

Unfortunately it is the repeats that I am interested in. All of my queries have long simple tandem repeats. I don't think there is a simple solution to this issue, but if anyone has any ideas for a workaround, I'd still be interested.

-jonmk-