Protocol Online logo
Top : New Forum Archives (2009-): : Bioinformatics and Biostatistics

formal name for a type of repeat - (May/11/2020 )

I have been simulating artificial DNA sequences to try to understand how certain sequence characteristics influence the mutual information computed between two positions in a sequence across a large sample of sequences. The mutual information is essentially like a correlation that is appropriate for symbolic sequences rather than numeric values. One type of simulated sequence requires embedding short subsequences like NXYNZWNJK where N is always the same nucleotide but X,Y,Z,W,J,K are free to vary across instances of this short sequence. Another way to describe this is to say that the initial nucleotide repeats at multiples of distance 3 bp: positions 1, 4, 7, 10, etc. This is different from a true trinucleotide repeat such as NXYNXYNXY in which the whole trinucleotide repeats a variable number of times.

 

My question is this: Is there a formal name for this kind of repeat structure? As I said above, I can't refer to it as a true trinucleotide repeat, but I am wondering if this kind of repeating nucleotide with interleaved nucleotides that are free to vary has a formal name in the literature. I have been referring to them in the paper that I am writing as "hollow" trinucleotide repeats in the sense that the structure of a true trinucleotide repeat has been "hollowed" out leaving only the initial nucleotide repeating with the other positions basically serving as placeholders, but I don't want to invent a new term if there is already a formal way of referring to this kind of repeat. This is especially true if I am overlooking some aspect of this sequence characteristic that goes by a different name.

 

Any help would be appreciated.

-dannemil-

The term invariant fits the description of what you want. Here's an example in a scientific paper.

 

The only time you might find such a situation in real life is where you have a string of amino-acids that can only start with one nucleotide. If you refer to a genetic code table, you will see that both Phenylalanine (F) and Tyrosine (Y) are two amino acids that only ever start with T/U in the first position. So assuming no mutations, every first position in a string such as:

Y.  Y.  Y.  Y.  Y.  Y.  Y.  F.  F.  F.  F.  F
Tyr Tyr Tyr Tyr Tyr Tyr Tyr Tyr Phe Phe Phe Phe

Could have the DNA sequence:

TAT TAC TAT TAC TAT TAC TAT TAC TTC TTT TTC TTT TTC TTT
Tyr Tyr Tyr Tyr Tyr Tyr Tyr Tyr Phe Phe Phe Phe Phe Phe

Where every 1st position in a codon is a T - these would be invariant nucleotides in this case. In real life you are unlikely to empirically find such a case, and if you do, it is most likely to be a single base that can only ever have 1 amino-acid configuration for some reason.

 

 

 

-bob1-

Thank you very much for the reference and the explanation. I should have but neglected  to mention that I'm trying to understand some sequence characteristics of promoter regions upstream of the TSS, so it doesn't only involve trinucleotides.  In particular, the fact that in many promoters, there appears to be a slight bias in favor of the nucleotide at position x + 6 being the same as the nucleotide at position x with x + 6 being 3' to x at least over some parts of the sequence.

Something like N-U-W-X-Y-Z-N-Z-X-W-U-Y-N where UWXYZ are free to vary, but there is a bias for nucleotides at x, x+6, x+12 to be the same.

 

Thanks again.

-dannemil-