Created: 3rd January 1999, last updated: 4th January 1999, © 1999 ABRF


An Analysis Of Techniques Used To Improve The Accuracy Of Automated DNA Sequencing Of A GC-rich Template: Results From The 2nd ABRF DNA Sequence Research Group Study

 

Pamela Scott Adams1, Mary Kay Dolejsi2, George Grills3, Doug McMinimy4, Paul Morrison5, John Rush6 , Stephen Goff7, Maureen Milnamow7, Allen Morgan7, Max A. Karlovitz8, C. Ellison Dial8 and Susan H. Hardin9

1Trudeau Institute, Saranac Lake, NY; 2Fred Hutchinson Cancer Research Center, Seattle, WA; 3Albert Einstein College of Medicine, Bronx, NY; 4The Jackson Laboratory, Bar Harbor, ME; 5Dana-Farber Cancer Institute, Boston, MA; 6Howard Hughes Medical Institute, Harvard Medical School Boston, MA; 7Novartis Biotechnology, Research Triangle Park, NC; 8Daniel H. Wagner Associates, Inc., Malvern, PA; 9University of Houston, Houston, TX


A study was conducted by the Association of Biomolecular Resource Facilities' (ABRF) DNA Sequence Research Group to determine the efficacy of several strategies that are often recommended as a means to improve the reliability of DNA sequence information obtained from templates with high guanine and cytosine (GC) content. The strategies investigated included: the addition of a denaturing co-solvent (dimethylsulfoxide, DMSO), alteration of the thermocycling temperature profile, and manual post-analysis sequence editing. The data was analyzed by both Perkin Elmer/Applied Biosystems Division (PE/ABD) and phred basecalling programs and then evaluated by several different methods. The study showed that the addition of DMSO can improve the number of correct basecalls and the number of bases with high confidence levels, especially if the sequence obtained using the manufacturer's recommended conditions was of poor quality. Altered thermocycling temperature protocols can be helpful, but the impact is not as reproducible as the effect of DMSO. Manual editing produced significant improvements in the number of correct basecalls for this GC-rich template. Most of the data was produced using dye-terminator chemistry with AmpliTaq DNA polymerase, FS, on PE/ABD automated DNA sequencers.

 

Introduction

This paper describes the results from the ABRF DNA Sequence Research Group's (DSRG) second annual study, conducted in 1997. The first study, conducted in 1996, was designed to assess the ability of laboratories to sequence a moderately difficult, double-stranded plasmid DNA sample containing a GC-rich insert. In the 1996 study, data was analyzed to determine the effectiveness of protocols developed for GC-rich samples, the effects of changes in sequencing chemistry, the differences in performance due to differences in sequencing hardware and types of products used, and the effects of manual editing (1, 2). This first study showed that gel length, manual data review, and sequencing chemistry were all major factors influencing sequence accuracy and length of read.

A few facilities submitted data to the 1996 study that allowed direct comparison of the effects of either DMSO or manual editing on sample accuracy. Sequences were submitted both with and without DMSO in the sequencing reactions, or using DMSO combined with an altered thermocycling temperature profile. In 3 out of 4 of these cases, DMSO or the altered thermocycling temperature profile increased the length of read of correct sequence. However, many of the top-ranked responses did not use DMSO or an altered thermocycling temperature profile. The first study also found that for 6 direct comparisons, manual editing of this difficult template improved the correct sequence by an average of 157 additional bases. This result was in contrast to an earlier study by the ABRF Nucleic Acids Research Group (3), using different chemistry, that found manual editing did not provide any significant benefit.

These results prompted the DSRG to base its second study on three specific questions concerning GC-rich templates:

1. Does the addition of DMSO consistently improve sequence results?

2. Can altered thermocycling temperature profiles improve sequence results?

3. Does post-analysis, manual editing consistently improve sequence accuracy?

The 1997 data was analyzed by comparing the number of errors over specified intervals using the sequence generated by the instrument manufacturers' basecalling software. It was subsequently analyzed by comparing the quality values that were assigned to each base using the phred basecalling software available from the University of Washington (4, 5, 6). The quality values determined by phred reflect the likelihood of a correct basecall being made for each position in the chromatogram on the basis of the trace characteristics. The use of quality values in assessment of sequence quality has gained increasing acceptance as an alternative to counting errors and is useful for the comparison of raw data generated from many different facilities. Additionally, Wagner Associates, an independent mathematical analysis company, analyzed the 1997 data as part of a collaborative study with the DSRG (7).

The 1997 study demonstrates that the majority of laboratories benefited from the addition of DMSO, as judged by number of correct basecalls or the number of bases with acceptable confidence levels, especially if the initial sequence was poor. The impact of altered thermocycling temperature profiles was much more variable. Manual editing of the sequence data consistently improved the number of correct basecalls.

 

Materials and Methods

Test Sequence Information

The test sample was provided by Ed Laufer, Olivia Orozco, and Cliff Tabin of the Department of Genetics at Harvard Medical School. This is a recombinant plasmid that uses the Bluescript II SK phagemid as a vector and contains the chicken lunatic fringe gene (Genbank Accession #U91849) as a 2,770 base-pair insert. This gene encodes an intracellular signaling molecule involved in many aspects of embryonic development (8).

DNA and Survey Preparation

The plasmid DNA sequencing template was prepared for the study by two rounds of CsCl/ethidium bromide equilibrium centrifugation followed by isopropanol extraction and ethanol precipitation. Twenty microgram samples were distributed into microcentrifuge tubes and dried in a SpeedVac. The DNA sample, a sample survey, a floppy disk and a return envelope addressed to a third party, were mailed to 134 ABRF member laboratories that offer DNA sequencing as a service. Participants were asked to obtain sequence using M13 Reverse primer from four different reaction conditions: 1) according to their sequencing machine manufacturer' s recommended condition, 2) this condition supplemented with 5% DMSO, 3) this condition using an altered thermocycling temperature profile (AT), and 4) this condition using 5% DMSO and AT. The instructions for reaction processing did not request a particular altered thermocycling temperature profile. In addition, both unedited and edited data for each condition were requested.

Analysis Methods

The first type of analysis performed by the DSRG was an analysis of errors contained within specified ranges. Sequencher (Gene Codes Corporation, Ann Arbor, MI) was used to compare the test sequences to the known sequence. All submissions were trimmed of approximately 50 bases at the 5' end, since we were primarily interested in each laboratory's ability to sequence through GC-rich areas, rather than how accurate their sequencing was close to the primer. The cumulative number of errors over three equally spaced intervals downstream (3') from the priming site was determined for each sequence. Substitutions (miscalls and ambiguities), insertions and deletions were considered errors. The resulting data was compiled using Microsoft Excel. One way ANOVA with Dunnett's post test was performed using GraphPad Instat (Version 3.0 for Windows 95, GraphPad software, San Diego, CA).

To examine the data with a second method, all unedited chromatograms were processed on a SUN Ultrasparc UNIX workstation using the basecalling analysis software phred. Phred software examines each fluorescent peak in the PE/ABD-generated chromatogram file and assigns a basecall as well as a 'quality value' to each peak. These quality values correspond to the inverse probability of a correct base assignment. For example, a quality value of 20 corresponds to approximately 1 error in 102, or a 99% accurate basecall. A quality value of 30 corresponds to approximately 1 error in 103, or a 99.9% accurate basecall. A companion program, phrap, aligns the sequences and provides (among other things) the number of bases at a particular quality, the average base quality of the total sequence and the number of discrepancies (errors) at each quality.

An independent mathematical consulting company also analyzed this data. The machine-generated basecalls in each analysis file (chromatogram) were aligned to the known sequence with software that implements a standard alignment algorithm (9) and error rates were generated from this data. S-Plus version 3.3 for SunOS (MathSoft, Seattle, Washington) was used to compute the p-values.

 

Results

Forty-eight facilities, using 49 different machines, returned 246 sequencing chromatograms. Of these, 230 were appropriately annotated and, therefore, included in this analysis. As the responses arrived, any information revealing the identity of the responder was removed. The results of the sequencing "test" on the template with a 74% GC-rich area were tabulated and are presented below. Preliminary analyses of the data were presented at ABRF '97 in Baltimore, MD (10), prepared as an electronic poster on the web (11) and presented at the Ninth Annual International Genome and Sequencing Analysis Conference at Hilton Head, SC (12).

Summary Of Instruments And Parameters Used By The Study Participants

Ninety-six percent of respondents used a PE/ABD Model 373 or 377 machine. There were two submissions by laboratories using LI-COR machines (one Model 4000 and one Model 4000L) and one from a facility using a PE/ABD Model 310. The numbers, types and well to read (WTR) lengths of the automated DNA sequencers participating in this study are shown in Figure 1.

Figure 1. Participation. The percentages, types and WTR (well to read) lengths of automated DNA sequencers used in this study. Instruments included PE/ABD models 310, 373A, 373S and 377 machines, and LICOR models 4000 and 4000L machines. The total number of instruments was 49.

Eighty-nine percent of the respondents used dye-terminator sequencing chemistry, 96% used AmpliTaq DNA polymerase, FS, enzyme and 81% reported using PE/ABD Prism Ready Reaction kits. Sixty-eight percent used 20 microliter ("whole") reactions, while 20% had switched to half reactions (10 microliter volumes) and one laboratory reported using quarter reactions (5 microliter volumes). Seventy-two percent used some type of spin column to eliminate the unincorporated dye-terminators from the sequencing reaction. Of those laboratories using columns for cleanup, 24% reported that they routinely reuse the columns.

Analysis of DMSO and Altered Thermocycling Temperature Profile Effects: Individually and Combined

The effects of DMSO and altered thermocycling temperature profiles were examined in a variety of ways. Initially, the original basecalls generated by the machine (unedited sequences) were analyzed for errors using Sequencher. The number of errors over cumulative 200 base ranges were compared for each condition that had been requested: Manufacturer's recommended conditions (Standard), 5% DMSO, Altered Thermocycling temperature conditions (AT) or a combination of both (DMSO+AT). For this analysis the results from the 373A with a 24 cm WTR were omitted from the 0-600 range because these machines rarely provide accurate data in the 400-600 base range. When the average of the number of errors for each condition over these ranges were calculated and plotted, the effects of alternative reaction conditions can be seen (Figure 2).

Figure 2. Effect of reaction conditions on sequence accuracy. Analysis of unedited sequences, showing the effects of adding DMSO, using altered thermocycling temperature (AT) conditions or a combination of both (DMSO+AT). The average and the standard error of the mean (SEM) for the number of errors are plotted. The number of data points (n) for each group is indicated on the x-axis below the condition. The values for the 373A with a WTR of 24 cm were omitted from the 0-600 range because these machines rarely give accurate sequence in the 400-600 base range.

Both conditions where DMSO was present (DMSO and DMSO+AT) showed a lower number of errors over all the ranges when compared to standard sequencing conditions. However, only the data from the DMSO containing reactions in the 0-400 and 0-600 range were significantly different (at the p<0.05 level) from the standard conditions. The altered thermocycling temperature condition did not appear to afford any advantage alone or in combination with DMSO when analyzed by averaging the number of cumulative errors.

The effects of these conditions on the quality of sequence data were also examined using the basecalling software phred. Table 1 summarizes these results. It lists the average number of bases with a quality >20, the average base quality, and the number of errors made in base calls with a quality >20, for the same four conditions described above. Again the results from the 373A were omitted from this analysis because they would disproportionately weight

Table 1. PHRED/PHRAP analysis of conditions.

Condition No. of bases Quality>20 Base Quality Errors N
   Avg.  SEM  Avg.  SEM    
Standard 260.0 21.9 10.9 0.9 8 30
DMSO 311.4 23.4 12.5 9.0 1 25
AT 302.3 23.6 12.4 0.9 1 20
DMSO + AT 326.6 22.9 12.9 1.0 3 14

 

The effects of DMSO, Altered Thermocycling conditions (AT), or a combination of the two (DMSO+AT) when nonedited sequences were analyzed by phred are shown. The values for the 373A with a WTR of 24 cm were omitted. Shown are the average number of bases with quality >20 (99% accuracy), the average quality of the bases, the number of errors which phred made in calling bases with a confidence level > 20 and the number of data points (N) in each group. SEM is provided for the averages.

the results from the other automated sequencers. As can be seen, the DMSO containing groups (DMSO and DMSO+AT) had more bases with a quality >20, a higher overall average quality and fewer errors in basecalling by the phred algorithm. Using base quality as a measure, there also appears to be a benefit from the altered thermocycling temperature condition. However suggestive these results appear, none of the differences are statistically significant.

A third type of analysis investigated the effects of adding DMSO or altering thermocycling temperature profiles on error rates. In this case, the effects of DMSO and AT were analyzed separately. The analysis utilized all unedited sequences regardless of machine type or gel length (WTR). To see the effect of addition of DMSO, the files were divided into two sets: those where 5% DMSO was added (DMSO and DMSO+AT); and those where DMSO was not added (standard conditions and AT). These results are summarized in the first three lines of Table 2. Error rates were computed for both sets in each of three base call ranges (1-200, 1-400, and 1-600). Miscalls, insertions and deletions - but not ambiguities (N's) - were considered errors. Similarly, the effect of altered thermocycling temperature profiles were investigated by dividing the files into two groups, and are listed in the second set of three lines in Table 2. These show error rates for basecalls made with altered thermocycling temperature profiles (AT and DMSO+AT) and without altered thermocycling temperature profiles (standard and DMSO).

Table 2. Effect of reaction conditions on error rate.

DMSO

No DMSO

Position Error Bases Error Rate, %  

Errors

Bases

Error Rate, %

 

P Value

1-200 110 9097 1.21  

201

11030

1.82

 

0.0005557

1-400 221 17321 1.28  

402

20476

1.96

 

0.0000002

1-600 435 21881 1.99  

629

24976

2.52

 

0.0001367

                   

Altered Thermocycling Temp

No Altered Thermocycling Temp

Position

Errors

Bases

Error Rate, %

 

Errors

Base

Error Rate, %

 

P Value

1-200

107

8508

1.26

 

204

11619

1.76

 

0.0055648

1-400

229

15742

1.45

 

394

22055

1.79

 

0.0140433

1-600

404

19342

2.09

 

660

27515

2.40

 

0.0288106

 

The effect of 5% DMSO and Altered Thermocycling temperature conditions on error rate. The number of errors is for all unedited sequences. Errors include miscalls, insertions and deletions, but not ambiguities. Data is from all types of machines.

As can be seen in Table 2, the number of errors from data in which the sequencing reactions contained DMSO was significantly lower than for those not containing DMSO, over all ranges. Altered thermocycling temperature profile conditions also had significantly fewer errors, but showed a less significant benefit over all ranges. The last column of the table addresses the significance of these lower rates. The p-value reported for each line of the table is the probability that the two error rates in that line represent two independent data sets from a single binomial distribution, that is, that the two rates differ due to chance only. In all cases, the differences between the two error rates are significant at a p<0.05 level. In the case of DMSO addition, the differences in error rates are highly significant.

Although statistical analyses of the combined results can yield useful information, the results obtained on a laboratory to laboratory basis can also provide insight. The difference in the number of errors between the four different conditions was determined for each individual laboratory (Table 3).

Table 3. Individual variation due to conditions/editing.

Comparison DMSO AT DAT Editing
Total comparisons (N 35 24 23 86
Improved 89% 67% 83% 88%
Average # of additional bases +/- SEM 31 +/- 6 48 +/- 20 37 +/- 9 24 +/- 3
Not Improved 11% 33% 17% 12%
Average # of fewer bases +/- SEM -13 +/- 2 -28 +/- 10 -30 +/- 17 -85 +/- 22

 

In every case where data was submitted from one laboratory directly comparing conditions, the difference in the number of errors from the standard or unedited condition was calculated. The percentage of labs which showed improvement and the average number of "additional" bases +/- SEM is shown. In a similar manner, the percentage of labs which showed a negative effect and the average "loss" of bases +/- SEM can be seen.

When examined in this manner, DMSO clearly benefits most labs (89%), while only 67% of the labs benefit from AT. The combination of conditions (DMSO+AT) was similar to DMSO alone (83%). If the sequence obtained using the manufacturer's recommended conditions was of poor quality, then the addition of DMSO was especially beneficial (data not shown). Additionally, for the data from laboratories that showed increased errors from the inclusion of DMSO or the use of an altered thermocycling temperature profiles, the latter had a more detrimental effect on the sequence data.

General trends can also provide information. Table 4 shows the top sequences as determined by the fewest number of machine-generated errors, categorized by the different types of machines. Eleven of the fifteen best sequences contained DMSO, while only one used an altered thermocycling temperature without DMSO. This trend is also reflected in the number of high quality bases and the average base quality determined by phred. The very best sequence utilized neither DMSO nor altered thermocycling temperatures; however, this laboratory reported using an alternative reaction buffer. Thus, only two of the top 15 laboratories used the manufacturers' recommended conditions to obtain the most accurate sequence for this GC-rich template.

 

Table 4. Best sequence from the top labs.

CODE

DMSO

AT

Number of Errors

# Bases with Quality >20

Avg Base Quality

Machine

WTR

1-200

1-400

1-600

 

377/373S

8314.AB

   

0

0

0

502

21.4

373S

48

O627

+

 

2

2

3

484

17.9

373S

48

1277.5

+

 

0

0

4

323

12.9

377

36

8885

+

 

3

3

4

479

15.4

377

48

9997

+

+

0

1

6

427

16.1

377

36

9414.5

+

+

2

2

9

416

11.6

373S

48

9978

   

1

2

10

NA

NA

377

36

1459

+

+

2

3

10

396

15.6

373S

48

5677.5

+

+

1

1

11

386

13.1

377

36

 

LI-COR

3708

 

+

2

2

2

306

16.0

LI-COR

31

2287

   

5

6

6

447

14.0

LI-COR

56

 

373A

6991

+

 

0

6

57

NA

NA

373A

24

O175

+

+

2

6

60

321

15.7

373A

24

5836

+

+

0

9

64

184

9.9

373A

24

7329

+

 

1

10

58

NA

NA

373A

24

Sequences were sorted by code number (lab), then by the number of errors (machine generated) contained in the 1-600 column, and finally, by the 1-400 error column. The best sequence from each lab was selected and sorted by WTR length. For the ABI 373A, which has a 24 cm WTR, the sequences were sorted by the 1-400 column and then the 1-200 column. The Code column contains the participant's 4 digit identification number. '.5' after a code indicates a "half" (10 µl) reaction. '.AB' indicates that an Alternate Buffer was used. The DMSO and AT columns indicate whether 5% DMSO or Altered Thermocycling temperature conditions, respectively, were used to produce the data. No letter indicates that standard manufacturer's conditions were used. The number of bases with a quality >20 and the average base quality as analyzed by phred are shown. 'NA' indicates that the sample data was not available for the phred analysis. The machine type and WTR length are indicated.

Effects of Manual Editing on Data Accuracy

The top panel of Figure 3 demonstrates the effect of manual editing on sequence accuracy. Over all ranges, editing reduced the number of errors. All conditions and all machines were utilized in this analysis. When examined on an individual basis, 88% of the sequences were improved by editing (Table 3). The average improvement from manual editing was 24 more correct over the entire range of 600 bases analyzed. Twelve percent of the sequences decreased accuracy after editing by an average of 85 bases. However, it must be noted that editing eliminated some of these bases from the 3' end of the sequence read. In this analysis, sequence reads that did not extend through base 600 were scored as errors for the number of bases short of base 600. For example, if a sequence was perfectly matched with the known sequence through base 575, but ended at this position, it was scored as having 25 errors in the 400-600 base range. The apparent decrease in accuracy for these samples may be an artifact of this analysis method.

Figure 3. Effects of manual editing. (Top) Analysis of all sequences showing the effects of editing. The average and the SEM for the number of errors are plotted. The number of data points (n) for each group is indicated on the x-axis. The values for the 373A with a WTR of 24 cm were omitted from the 0-600 range because these machines rarely give accurate sequence in the 400-600 base range. (Bottom) Chromatogram of an unedited sequence showing the GC-rich nature of this template and the benefits to be gained by manual editing. This sample was run under standard conditions using dye-terminator chemistry on a 373S using a 48 cm (WTR) gel.

 

Because this sequence is 74% GC-rich over 450 bases, it contains a very high number of occurrences of A's preceding G's (AG errors). This base order produces a G with a very low signal strength, a known artifact with the AmpliTaq DNA polymerase, FS, and the rhodamine dye chemistry (13, 14). An example of this phenomenon is shown in the lower panel of Figure 3. Note the N's - almost all of which could be called correctly by manual editing. In the list of overall ranking, this sequence ranked #106 in its non-edited version with 30 errors in the first 600 bases, while the edited version ranked in the top 10 sequences with NO errors out to 667 bases.

A statistical analysis of the rate of occurrence of AG errors, and how it is affected by the conditions, is shown in Table 5.

 

Table 5. Effect of conditions on error rate at AG.

 

DMSO

No DMSO

 

Position

Errors

Bases

Error Rate, %

Errors

Bases

Error Rate, %

P Value

1-200

52

752

6.91

102

872

11.70

0.00139760

1-400

116

1123

10.33

229

1235

18.54

0.00000020

1-600

222

1363

16.29

351

1435

24.46

0.00000010

               
 

Altered Thermocycling Temp

No Altered Thermocycling Temp

 

Position

Errors

Bases

Error Rate, %

Errors

Bases

Error Rate, %

P Value

1-200

49

706

6.94

105

918

11.44

0.0028714

1-400

126

1016

24.84

219

1342

16.32

0.0091465

1-600

219

1190

53.45

354

1608

22.01

0.0218458

The effect of 5% DMSO and Altered Thermocycling on error rate at the AG location. Only errors at AG are shown as a function of all AG contexts. Errors include miscalls, insertions and deletions, but not ambiguities. Data is from all types of machines.

The results here are analogous to those displayed in Table 2. For example, line 1 of Table 5 shows the rates at all AG errors that occur within positions 1-200. Two such error rates are displayed, for basecalls made with and without DMSO. As in Table 2, this analysis is restricted to non-edited basecalls and excludes N's. Note that all error rates in Table 5 are considerably higher than those in Table 2, reflecting the general problems at AG positions. Note also that the error rates without DMSO are again consistently higher than with DMSO. As in Table 2, the p-value in each line represents the probability that the two error rates could have occurred as independent draws from a single distribution. Strikingly, the overall error rate for all conditions, all machines through 600 bases is 2.47% (Table 2), while the rate at AG errors over 600 bases can be as high as 24.5% (Table 5). In particular, the AG error rate between 400-600 bases, without addition of DMSO condition is an astounding 61%, but decreases to 44% with the addition of DMSO (data not shown).

Machine Performance.

The number of bases with a quality >20 (as assigned by phred software) were used to determine the relative performance of the various machines types and different gel lengths (Figure 4, upper panel). As could be expected, a longer gel length (WTR) is generally correlated with an increased length of high quality sequence information. The LI-

Figure 4. Machine performance. The average number of bases with a quality >20 (phred analysis) for the best condition submitted by a laboratory for each type of machine and WTR length is indicated in the top panel. The SEM and the number of machines in each group (n) are shown. The average number of errors for the 0-600 range for the best condition submitted by a laboratory for each type of machine and WTR length is shown in the bottom panel. The SEM and the number of machines in each group (n) are shown.

COR instruments, and the 373S-48 cm WTR, when compared to the 373A, are the only differences that are statistically significant. The relative performance is also shown in terms of the average number of errors, for the best results for each machine type and WTR (Figure 4, lower panel). In this case, the longer gel length generally correlates with fewer errors. However, the listing of the best overall sequences (Table 4), shows that highly accurate sequence can be generated on either 373S or 377 machines, running either 36 or 48 cm WTR gels.

 

Discussion

The 1997 ABRF DSRC study addressed three specific questions related to obtaining high quality DNA sequence information from a GC-rich template. First, does inclusion of DMSO significantly improve the quality of DNA sequence data? Second, does an altered thermocycling temperature profile improve sequence data? Finally, does manual editing of data improve sequence accuracy? Data for this survey was provided by 48 ABRF member laboratories, and was produced from a broad spectrum of automated sequencers, including Perkin-Elmer/ABI Models 373A, 373S, 377 and 310, and LI-COR Models 4000 and 4000L. This study incorporated multiple types of data analysis, both basecalling accuracy (using the manufacturers' basecalling software or phred basecalling software) and statistical analyses of submitted data, to maximize the usefulness of the information for the broadest possible audience. Core laboratories traditionally think in terms of sequence accuracy. Large-scale genome sequencers find base quality most useful for assembling large contigs. Software developers find error rates most useful for designing better basecalling algorithms.

The benefit of DMSO addition seems clear. The data in Figure 2 and Tables 1, 2, 3, 4, and 5, demonstrate that the addition of DMSO to the sequencing reaction reduces the number of errors and increases the number of high quality basecalls in the majority of laboratories. The benefits of DMSO have been previously reported (15). Presumably, the inclusion of DMSO in the reaction does not chemically change the DNA or specifically affect the DNA polymerase; rather, it affects hydrogen bond formation between the bases thereby eliminating or minimizing template secondary structures and allows the DNA polymerase to more easily replicate the template.

The use of altered thermocycling temperatures produced the most variable results; in some cases it produced superior sequence, and in other instances produced inferior results to the standard sequencing protocol. This variation produced equivocal results when analyzed as a group by the number of errors. Further analysis by base quality seemed to indicate that altered thermocycling temperature conditions might provide some benefit. The independent analysis by an outside consulting firm also indicated some benefit. When compared on an individual laboratory basis, it was found that altered thermocycling temperature conditions yielded improved sequence in 67% of the sequences submitted. The labs that did not benefit from altered thermocycling temperature conditions increased the average number of errors enough to mask the benefits. This variability may be explained by the variety of modified cycle sequencing temperature parameters used by different lab groups, each performing with varying success on this template. Alternatively, the sequence data produced using altered thermocycling temperature conditions resembles that from standard reaction conditions. The different thermocycling temperature profiles used by the participating labs may demonstrate that the reaction is relatively insensitive to a variety of cycling conditions - i.e., tolerant of both user and/or instrument variations.

We observed that manual data editing significantly improved data accuracy. This was true for 88% of the submissions. The actual number and extent of this benefit may be higher. Some of the edited sequences were also trimmed at the 3' end, resulting in a sequence less than 600 bases, but with higher accuracy over the remaining sequence. Since missing bases were counted as errors in this simple analysis, this increased the number of errors in the edited sequence, as explained in the Results section. It may also be that the impact of manual editing on the accuracy of a sequence is exaggerated in this study due to the high incidence of AG errors in the test template. This base combination, when sequenced using rhodamine dye-terminator chemistry, produces a weak G that often results in an ambiguous basecall. The effectiveness of manual editing is dependent on the skill and knowledge of the individual working with the sequence. However, with limited experience, a researcher can easily identify and correct the ambiguity or miscalled base. It is important to note that since manual editing of the called bases significantly improves data accuracy, the major commercially available basecalling algorithm is not optimal for this chemistry. In spite of the phenomena of AG errors, 96 % of the participating laboratories chose to use dye-terminator chemistry on this GC rich template and only 12% reported that they use dye-primer chemistry at all. Recent improvements to DNA sequencing chemistry (including dRhodamine and BigDyes chemistries) in the past year have decreased the impact of the AG artifact.

In conclusion, the addition of DMSO to a sequencing reaction shows the greatest benefit when sequencing a GC-rich template. Altering thermocycling temperature conditions can be helpful but the exact conditions may be laboratory and/or thermocycler dependent. Manual editing provides more correct sequence on this GC-rich template. Overall performance appears to be quite good, especially considering the difficulty of the template and the diversity of laboratories. It is difficult to make comparisons with previous studies because the field of DNA sequencing is changing so rapidly with new machines and chemistry being introduced at a rapid pace. It is hoped that these results will be useful to DNA sequencing laboratories in evaluating their performance.

 

Acknowledgments

We would like to thank: Christine Bogle and Fernando Vileria of the Dana-Farber Cancer Institute for sample distribution and Dr. Theodore W. Thannhauser of Cornell University and Dr. Anthony T. Yeung of Fox Chase Cancer Center for manuscript review. Most especially, we thank all study participants for providing data for these analyses. Portions of this work were supported by a Phase II SBIR grant from the NSF (contract DMI-9612376 to Wagner Associates).

 

References

1. Adams PS, Dolejsi MK, Hardin S, Mische S, Nanthakamur B, Riethman H, Rush J, Morrison P. DNA sequencing of a moderately difficult template: Evaluation of the results from a Thermus thermophilus unknown test sample. BioTechniques. 1996;21: 678.

Adams PS, Dolejsi MK, Hardin S, Mische S, Nanthakamur B, Riethman H, Rush J, Morrison P. DNA sequencing of a moderately difficult template: Evaluation of the results from a Thermus thermophilus unknown test sample and general survey. Available from URL: http://mbcf.dfci.harvard.edu/abrfdnaseq.

3. Naeve CW, Buck GA, Niece RL, Pon RT, Robertson M, Smith AJ. Accuracy of automated DNA sequencing: A multi-laboratory comparison of sequencing results. BioTechniques. 1995; 19:448-453.

4. Ewing B, Hillier L, Wendl MC, Green P. Basecalling of automated sequencer traces using phred. I. Accuracy assessment.. Genome Research. 1998;8(3):175-185.

5. Ewing B, Green P. Basecalling of automated sequencer traces using phred. II. Error probabilities. Genome Research. 1998;8(3):186-194.

Genome Software Development at the University of Washington. Available from URL:http://bozeman.genome.washington.edu.

Collaboration Initiated For Mathematical Analysis Of DNA Sequencing Studies. ABRF News 1997;8(3):6 Available from URL: http://www.abrf.org/ABRFNews/1997/September1997/sep97math.html.

8. Laufer E, Dahn R, Orozco OE, Yeo CY, Pisenti J, Henrique D, Abbott UK, Fallon JF, Tabin C. Expression of radical fringe in limb-bud ectoderm regulates apical ectodermal ridge formation. Nature. 1997;386: 366-402.

9. Myers EW. An Overview Of Sequence Comparison Algorithms In Molecular Biology. Technical Report. Tucson (AZ): Department of Computer Science, University of Arizona; 1991. Report No.: TR 91-29.

10. Adams PS, Dolejsi MK, Hardin S, McMinimy DL, Rush J, Morrison P. 2nd Annual ABRF DNA Sequence Research Committee Study: Effects of DMSO, Thermocycling and Editing on a Template with a 72% GC Rich Area. (Abstract) ABRF '97: Association of Biomolecular Resource Facilities Conference on Techniques at the Genome-Proteome Interface. 1997

Adams PS, Dolejsi MK, Hardin S, McMinimy DL, Rush J, Morrison P. (1997). 2nd Annual ABRF DNA Sequence Research Committee Study: Effects of DMSO, Thermocycling and Editing on a Template with a 72% GC Rich Area. ABRF DNA Sequencing Research Committee 1997 Web Poster. Available from URL: http://www.abrf.org/ABRF/ResearchCommittees/dsrcreports/dsrc97.html.

12. Adams PS, Dolejsi MK, Hardin S, McMinimy DL, Rush J, Morrison P. Effects of DMSO, Thermocycling and Editing on a Template with a 72% GC Rich Area: Results from the 2nd Annual ABRF Sequencing Survey Demonstrate that Editing is the Major Factor for Improving Sequencing Accuracy. (Abstract) Ninth International Genome Sequencing and Analysis Conference. Microbial and Comparative Genomics. 1997; 2(3), 198

13. Dye-terminator Peak Patterns with AmpliTaq DNA Polymerase, FS. PE Applied Biosystems User Bulletin 904357. 1996.

14. Parker LT, Deng Q, Zakeri H, Carlson C, Nickerson DA, Quok PV. Peak height variations in automated sequencing of PCR products using Taq dye-terminator chemistry. Biotechniques 1995. 19:116-121.

15. Burgett SG, Rosteck Jr. PR. Use of Dimethyl Sulfoxide to Improve Fluorescent, Taq Cycle Sequencing. In: Adams MD, Fields C, Venter JC, editors. Automated DNA Sequencing and Analysis. New York: Academic Press; 1994. p. 211-215.

 


Return to index