Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

How to batch download human genomic sequences for genes? - (Jan/24/2007 )

Hi-

I'd like to download human genomic DNA sequences for a list of ~200 genes, preferably in FASTA format. I have a list of genes for which I have the NCBI "Official Symbol" (eg SHH). What's the best way to do this? Dowloading protein & nucleotide sequences is straightforward using the NCBI batch Entrez utility, but I'm not able to figure out how to do this with the genomic sequence. Is there a script available that does this query? Thanks for your help!

-oldandbusted-

How many flanking sequence to each end do you want?

Or start from TSS?

-cyberpostdoc-

Thanks for your help!

Ideally, I would like the most "complete" sequence possible including both the coding and noncoding elements. I'm a novice, so please excuse my ignorance; but, my understanding is that this is best represented by the NC_ or NT_ sequence. Am I wrong?

-oldandbusted-

QUOTE (oldandbusted @ Jan 25 2007, 10:29 AM)
Thanks for your help!

Ideally, I would like the most "complete" sequence possible including both the coding and noncoding elements. I'm a novice, so please excuse my ignorance; but, my understanding is that this is best represented by the NC_ or NT_ sequence. Am I wrong?


simple answer: yes, you are right, annotation from NCBI on Contigs should be good for the start and end positions for a gene. And could be used for retrieval of the genomic sequence.

complicated answer:
1. It depends on the evidence that NCBI curators used to define the "start" and "end" of a gene. Most gene is specified by TSS (transcription start site) and "end" that might include polyadenylation signal regions. However, if the evidence up to the time of the annotation is not complete, the annotation of the start and end might not be accurate.
2. Depends on you downstream analysis, you might want to retrieve a certain number of flanking sequences.
3. If you want to look for transcription factor binding sites (TFBSs), for human, you might want to retrieve upstream 2000bp beyond TSS.
4. If you want to look for common polyadenylation signals, you might want more downstream flanking sequences.

EZRetrieve(http://siriusb.umdnj.edu:18080/EZRetrieve/index.jsp) is a tool that can do such job, but it uses a little old version of genomic sequences (human build 34). It was developed just before Entrez Gene replace Entrez Locuslink.

It allows you to retrieve by, for example:
  • BC014651: GenBank ID for HOXB6(homeo box B6, human)
  • 98428: UniGene ID for HOXB6(homeo box B6, human)
  • 3216: LocusLink ID for HOXB6(homeo box B6, human)
Since you said you have gene names, I think you must already have these IDs, I think Locuslink ID is what they use as geneID now, but I am not sure.

In terms of regions, it allows you to retrieve, for example:
  • From "-200" To "100": upstream 200bp and downstream 100bp relative to the gene start site.
  • From "-4000" To "0": upstream 4000bp relative to the gene start site.
  • From "0" To "4000": downstream 4000bp relative to the gene start site.
  • From "start" To "end": retrieve the whole gene sequence.
In your case, you should used the last option.

All that being said, you can go ahead to play with the tool and see if it can fulfil your need (it supports single and batch retrieval), since I built the tool, i know the whole process and can help you to retrieve the latest genomic sequences from NCBI.

Let me know if you need any further help.

-cyberpostdoc-