Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

computer readable databases - (Jul/16/2006 )

I had been looking at the influenza virus sequences at genbank, but they are not very
computer friendly, the names are spelled differently , the sequences are not aligned
and it's hard to identify which segments belong to the same virus.

Ideally I had one big file with one line for each virus containing
identification data and all 8 segments, aligned with the other viruses,
and just a blank if the nucleotide at that position isn't known.

And then I want that database to be updated easily when new viruses
become available.

Has someone such a database or can we share the work to build and
maintain it and also share the data then ?

-Guenters-

http://www.flu.lanl.gov/

-perlmunky-

QUOTE (perlmunky @ Jul 17 2006, 06:14 AM)


how do I automatically convert their data into one
formatted computer-readable file ?

-Guenters-

QUOTE (Guenters @ Jul 17 2006, 10:25 AM)
QUOTE (perlmunky @ Jul 17 2006, 06:14 AM)


how do I automatically convert their data into one
formatted computer-readable file ?


perl, python, java, bash, c, c++ the list goes on. What do you mean by computer readable? Chances are they are computer readable as they are. You could try CSV (comma separated values), your own construct etc.

For a start you want to download the data as parsing it from their web site may be a bit tough...

-perlmunky-

QUOTE (perlmunky @ Jul 18 2006, 01:17 AM)
QUOTE (Guenters @ Jul 17 2006, 10:25 AM)

QUOTE (perlmunky @ Jul 17 2006, 06:14 AM)


how do I automatically convert their data into one
formatted computer-readable file ?


perl, python, java, bash, c, c++ the list goes on. What do you mean by computer readable? Chances are they are computer readable as they are. You could try CSV (comma separated values), your own construct etc.

For a start you want to download the data as parsing it from their web site may be a bit tough...




fine would be a file with the aligned H5N1-sequences with all 8 segments,
1 line per virus (with 14000 bytes) 1 byte per nucleotide, blanks, where no
sequence is available .

And updates, once new sequences become available.

When you download from genbank, it's hard to identify with genes
belong to the same virus.

Are such databases made available ? Is there some standard format ?
Or has everyone to create his own in his own format ?

-Guenters-