Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

NCBI RefSeq FASTA Results - (Jul/02/2007 )

Howdy

I'll begin by declaring that I'm a computer science guy, relatively new to the world of genomics. As much as I love genomics, it does tend to be an a$$kicker without the proper background.

Having said that....I'm working with the mouse genome and am attempting to validate that the chromosome FASTA files I've downloaded from NCBI's FTP site match the ones that show up in NCBI when I do queries for specific genes. For example:

Search the Gene database for BC024897 and you find Tap1 on chromosome 17, bp 34324825-34333744 (Scroll down the page to Reference Assembly (C57BL/6J) and find the entry for NC_000083.5). When I pull up that FASTA sequence in NCBI and compare it to the sequence I've pulled via Perl script from the latest RefSeq chromosome FASTA file (same GI:149313536), I see two different sequences.

What am I missing? By working with the chromosomes and steering clear of the contigs, I thought I would avoid these kinds of disconnects. I'm stumped and need serious enlightening.

Jason

-WakeDude-

QUOTE (WakeDude @ Jul 2 2007, 04:34 PM)
When I pull up that FASTA sequence in NCBI and compare it to the sequence I've pulled via Perl script from the latest RefSeq chromosome FASTA file (same GI:149313536), I see two different sequences.


LOCUS NC_000083 95272651 bp DNA linear CON 10-JUL-2007
DEFINITION Mus musculus chromosome 17, reference assembly (C57BL/6J).
ACCESSION NC_000083
VERSION NC_000083.5 GI:149313536

They should be the same sequence. I think:
1. more information is needed about how your perl script work to pull refseq from chrom.
2. how did you compare the 2 sequences you've got, if you have unix version of blast2sq, you should compare using that program. Also, how diff. are the length of the 2 sequences?

-cyberpostdoc-