Table of Contents
Many problems with sequencing results are not recognized by viewing the text file alone. Therefore, the quality of your sequence should always be evaluted by studing the chromatogram to identify problem data and check basecalls made by the analysis software.
The general data quality should first be determined. Begin by opening the chromatogram file using one of the programs Editview (Mac),Chromas(PC) or BioEdit (PC). Below is a chromatogram displaying good quality sequence.
Peaks should be evenly spaced with minimal background noise. Good sequencing reactions can be expected to yeild read lengths of about 600 bases.
Signal strength is an indication as to how well the sequencing reaction ran. These values are found on the annotation page of your chromatogram which contains the information on the gel run. A sample of an annotation page is shown here. An average signal strength is given for each of the four dyes and listed under Ave. Signal Intensity:.
ABI recommends a signal strenth value of 200 for each of the four dyes. Substantially higher signal strength values usually indicates excess template in the reaction. Values between 50 - 100 are low, but acceptable, and good data can be obtained. Signal strenths can sometimes be improved with these templates by increasing the amount of DNA added.
Low signal strengths are problematic due to increased background noise. Background noise results in a higher percentage of N's and incorrect base calls. Notice the "noisy" nature of the following chromatogram of poor data.
Now that you have determined that you have good sequencing data, you are ready to edit you sequence. Base miscalls by the analysis software are common and should be expected. Common problems are miscalls in homopolymer regions, deletion/addition of bases and base miscalls. The following are examples of these and other more common basecalling problems.
The insertion of an extra base in the sequence is common near the end of the gel run. As the gel resolution deteriorates the peaks broaden. The analysis software uses a set value called base spacing to locate peaks in the chromatogram. The base spacing is optimal for the middle region of the gel where gel resolution is best, but not good for the end(or beginning) of the gel where resolution is poor. This can lead to a single peak being assigned as two bases by the analysis software. An example is shown in the following chromatogram between bases 640 and 650. The A at position 645 is an extra base assigned to the A peak. The same is true for the T at position 648. (Also notice that the G directly under the A at position 645 has been missed! Deletions are discussed next.)
The exclusion of a base in the sequence is common near the beginning of the gel run, but is also found throughout the sequence. Gel resolution is poor in the beginning of the sequence with peaks sometimes overlapping. Due to the basespacing and the analysis software looking for peaks at set intervals, a peak can be missed. Observe the missing A after the G at base 14. There are two distinct green A peaks but the analysis software has only called one.
A common base miscall is a G that follows an A. The rate of incorporation of G's after A's by the enzyme is low. Compare the signal intensity of the G at base 372 and 375 with the G at base 391.
The G's after A's in this sequence are weak but not miscalled by the software. The G, however, can be so weak that the software is unable to assign the base. The G at position 389 and 391 were assigned as an N because they were too weak.
Problems due to sequence context are observed. These problems are usually associated with GC-rich sequence that is problematic due to their high melting temperature. The following example is a GC-rich sample that has a compression region between bases 220-230. The peaks are broad and resolution is poor.
This problem was solved by increasing the denaturation temperature of the sample prior to loading it on the gel. Notice the peaks are now well defined.
The addition of too much template DNA in a reaction can reduce the read length of your sequence. This is observed as high signal strength in the beginning of the sequence followed by a rapid decline in signal later in the gel. The raw data window displays the signal intensity for the entire gel. The first image shows the results for good sequence that has an average signal intensity for the entire gel. The second image displays the results for a sequence with too much DNA. The initial signal is strong but the signal rapidly decreases.
The presence of unincorporated flourescently labelled dideoxyterminators on the sequencing gel is due to the insufficient clean-up of the sequencing reaction. The resulting dye blobs, so called for their blob appearance on the gel, will interfere with the analysis of the sequence. Strong dye blobs can cause the loss of the first 80 - 100 bases of the sequence. Adjacent lanes will also be affected causing loss of data in these lanes.
DNA with high GC-content can be difficult to sequence. The sequence will start out strong but the signal strength will rapidly decrease until there is no sequence data. Therefore, read lengths are typically shorter for these templates.
Template does not have to be GC-rich to have a problem with secondary structure caused by short regions of high GC-content. These regions can have secondary structure which the enzyme is unable to melt and process through. The secondary structure is observed as an abrupt stop in the sequence.
Poly A tails are difficult for the enzyme to process througth. A "stutter" effect is observed in the sequence directly downstream of the poly A region. This effect is caused by the disassociation and association of the enzyme with the template as it processes througth the poly A region. A wave appearance with the four dyes will be observed. Notice the increased number of N's directly following the poly A region.
Other homopolymeric regions such as the following run of G's also causes problems for the polymerase. In this example, the enzyme is unable to process through the G's and disassociates from the template.
Repetative regions are difficult for the enzyme to process through with out dissaciating from the template. Usually, as observed in the example below, the signal decreases to the point which no further sequence can be obtained.
Of course, not all sequencing and editing problems are shown here. These are a few of the more common problems that will arise when sequencing. Other resources are available on the web that discuss optimizing your sequence reaction. A good source to begin with is the QIAGEN Guide to Template Purification and DNA Sequencing. A good discussion of optimal conditions and the affect of contaminants on sequencing quality is found in this guide.
Questions on sequencing reactions or results can also be directed toward DNA Core Staff.
Click here to return to the DNA Core Home PageThese pages are maintained by Joe Forrester (Updated: 12/08/99)
Send comments to: (firstname.lastname@example.org)