Protocol Online logo
Top : New Forum Archives (2009-): : Bioinformatics and Biostatistics

pairwise alignment results explanation please - (Oct/06/2011 )

Hello Everybody,

I hope everyone is doing well!

I had to do a pairwise alignment for my Tomato UBC gene with an Arabidopsis UBC gene.The website I used to the alignment is http://www.ebi.ac.uk...nucleotide.html.
The Fasta sequence for the two genes are as follows:
Tomato UBC

ctcttcttcc atttctttca aaattaaagt attgttactc tgctattggc tcaaaacctc
tgcaatctcc gtctccttca atttcaactc aagcaaatcc acctctttca ctagtttcat
cactttcaga tcagggtttg gagttgaagg tacggggggc taattgatgg cgtcgaagag
gatattgaag gagctcaagg atctgcagaa ggatcccccc acatcatgca gtgctggtcc
agtggcagag gatatgttcc attggcaagc aacaatcatg gggcctaccg atagccctta
tgctggaggt gtatttttgg tttcaattca tttccctcca gattatcctt ttaagcctcc
aaaggttgcc ttcagaacta aggttttcca tcccaacatc aacagcaatg gaagtatttg
tctggatatt cttaaggagc agtggagtcc agcattaacc atatccaagg tcctgctgtc
catctgctct ctgttgacag acccaaaccc agatgatcct cttgtacctg aaattgctca
catgtacaag actgacaggg ccaaatacga aaccactgct cgtagctgga ctcagaaata
tgcaatggga tgatgcgcaa aatgtctcca ggcatgtctg ggactttgta acagcaatgt
cttatgtgct tggggtgaat gaataaattc cgtgaaagaa cttagttact tcttaatctc
ccttcatgag ggttgttaag ggaacagctg ttttcaattt gtgaatattt atttgatgac
tagtaaggga gaaactgcaa tgtaattcta ctttgtttgc cagtt

Arabidopsis UBC

GGAAATAGTTTGGTGATTTCTCGTAAAGATGTTTAAGAAAATGGATAAAAAAGCAGCGCAGAGAATTGCGATGGAATACAGAGCTATGATCTCGAAAGAA
TCTTTGTTCAGTATTGGTCAAAACTCGAACAATATATACGAATGGACTGCAGTGATCCGGGGTCCAGATGGCACTCCCTATGAAGGTGGCATGTTTAATC
TCAGTATTAAGTTTCCTACGGATTATCCTTTTAAACCACCCAAGTTTACGTTTAAAACTCCGATTTACCATCCAAATATCAATGATGAAGGATCGATTTG
CATGAACATTCTTAAAGACAAATGGACTCCTGCTCTTATGGTTGAAAAGGTGCTTCTGTCAATACTTTTACTATTGGAAAAACCAAACCCAGATGATCCT
CTTGTACCTGAAATTGGACAGCTCTTCAAGAATAACAGATTCCAATTTGACCAGAGAGCTCGAGAATTCACTGCTCGACATGCTAATAATTAAAATTTAT
AAAATTATTTATCTTACTTTCGAAGTTTGTCATATCGTATTTATTATACATAAACAGCTTCCTATCCTATGCTATTGTCGACATCTTTTCTATTATAAAT
AAAAGTCACATTCTTCGATTA

The results of the alignment is displayed below as a text file attachment:

Attached File

My problem is when looking at the results I don't understand it, I cannot make sense out of it? What is the explanation of the alignment, what do they mean by similarity and identity? And what can I deduce from this result? Please help me so I can know how to do the rest of my genes and make sense out of my work.

Thanks to all of you,
Yasamino

-yasamino-

After looking at your result, I think you have gotten as much out of it as possible. These genes don't appear to be that closely related by nucleotide sequence and there are far too many gaps and short stretches to give you any information. What you need is a protein alignment, but I tried looking for coding sequences within your sequences and I'm a bit confused. Each frame of both genes has at least a few stop codons in it, so are the sequences provided actually coding sequences? mRNA sequences? genomic sequences? It's important to know these things before you try to interpret any alignment.

As far as similarity and identity goes, for nucleotide alignment, there is no difference unless your sequence has ambiguous characters like Y, H, W, N, etc. For protein alignments, identity means exactly the same amino acid in a position, but similarity can mean amino acids with similar properties:

Serine and Threonine: Both hydrophillic, hydroxyl
Isoleucine and Leucine: Both hydrophobic, aliphatic
Aspartic acid and Glutamic acid: Both hydrophillic, acidic

When you get a protein alignment, repost and people can take a look.

Best of Luck.

-allynspear-

allynspear on Fri Oct 7 16:55:22 2011 said:


After looking at your result, I think you have gotten as much out of it as possible. These genes don't appear to be that closely related by nucleotide sequence and there are far too many gaps and short stretches to give you any information. What you need is a protein alignment, but I tried looking for coding sequences within your sequences and I'm a bit confused. Each frame of both genes has at least a few stop codons in it, so are the sequences provided actually coding sequences? mRNA sequences? genomic sequences? It's important to know these things before you try to interpret any alignment.

As far as similarity and identity goes, for nucleotide alignment, there is no difference unless your sequence has ambiguous characters like Y, H, W, N, etc. For protein alignments, identity means exactly the same amino acid in a position, but similarity can mean amino acids with similar properties:

Serine and Threonine: Both hydrophillic, hydroxyl
Isoleucine and Leucine: Both hydrophobic, aliphatic
Aspartic acid and Glutamic acid: Both hydrophillic, acidic

When you get a protein alignment, repost and people can take a look.

Best of Luck.


Hey allynspear,
Thanks for your reply.

An answer to your question the tomato UBC sequence is the complete CDS which I got from NCBI genbank (825bp), and the arabidopsis UBC gene is the full length cDNA which I got from the TAIR website (621bp).

I did a protein alignment as you asked using the translation sequence and its attached as a text file.

Attached File

I am still confused as to why a genomic, cDNA, mRNA, or CDS alignment would differ? How would the protein alignment help
Thanks again for your help,
Yasamino

-yasamino-

Okay, the protein alignment looks better, mostly because you don't have hundreds of gaps and you have significant stretches of identity/similarity. I suppose I should have first asked what exactly you are tying to do using this alignment, but in the meantime, I can answer some of your questions.

First, genomic alignments can be very confusing because there are all kinds of non-coding sequences (introns, regulatory sequences, etc) that don't affect the function of your protein. cDNA and mRNA sequences should be the same thing, but again, mRNAs contain 5' and 3' UTRs which can affect the regulation of expression, but don't affect the protein sequence at all. The big difference here is between DNA/RNA and protein. In DNA/RNA sequences, you only have 4 bases to choose from, so a single site could mutate A->T->A and it would look just like the original sequence, even though it had undergone 2 rounds of mutagenesis. With only 4 options, this has a higher probability of happening, than in proteins with 20 different amino acids. This means that for very closely related genes, nucleic acid alignments may give you usable information, but the farther apart you are evolutionarily, the less information you can get from a nucleic acid alignment. But the bigger issue is that nucleic acid sequence tells you nothing about protein function, since multiple different codons can code for the same amino acid, but some single base changes can give a dramatically different amino acid. You can have VERY, VERY different coding sequences that code for very similar proteins because of this fact. This all comes down to the fact that if two proteins need to perform similar functions, it is the protein sequence and not the nucleic acid sequence that is conserved. If you have two highly divergent sequences, you will most likely be able to pull usable information from the protein alignment when looking for functional or structural domains.

If you can provide more information as to what you are interested in learning from the alignment, I will do my best to help.

Best of Luck.

-allynspear-

allynspear on Sat Oct 8 15:28:53 2011 said:


Okay, the protein alignment looks better, mostly because you don't have hundreds of gaps and you have significant stretches of identity/similarity. I suppose I should have first asked what exactly you are tying to do using this alignment, but in the meantime, I can answer some of your questions.

First, genomic alignments can be very confusing because there are all kinds of non-coding sequences (introns, regulatory sequences, etc) that don't affect the function of your protein. cDNA and mRNA sequences should be the same thing, but again, mRNAs contain 5' and 3' UTRs which can affect the regulation of expression, but don't affect the protein sequence at all. The big difference here is between DNA/RNA and protein. In DNA/RNA sequences, you only have 4 bases to choose from, so a single site could mutate A->T->A and it would look just like the original sequence, even though it had undergone 2 rounds of mutagenesis. With only 4 options, this has a higher probability of happening, than in proteins with 20 different amino acids. This means that for very closely related genes, nucleic acid alignments may give you usable information, but the farther apart you are evolutionarily, the less information you can get from a nucleic acid alignment. But the bigger issue is that nucleic acid sequence tells you nothing about protein function, since multiple different codons can code for the same amino acid, but some single base changes can give a dramatically different amino acid. You can have VERY, VERY different coding sequences that code for very similar proteins because of this fact. This all comes down to the fact that if two proteins need to perform similar functions, it is the protein sequence and not the nucleic acid sequence that is conserved. If you have two highly divergent sequences, you will most likely be able to pull usable information from the protein alignment when looking for functional or structural domains.

If you can provide more information as to what you are interested in learning from the alignment, I will do my best to help.

Best of Luck.


Hi again,

I sort of understand things better now. In my case, my professor asked me to align my tomato gene with all the arabidopsis UBC genes at DNA level and phylogeny level to figure out which are the closely related ones to it. Once we figure them out, we have to look at the motifs and domains associated with those genes, and any literature pretaining to them to help us find ways to study our gene better.

this is awesome though, thanks for your time.
Yasamino

-yasamino-