Summary of the ABRF Workshop at the 1995 ASBMB Meeting:
Current Techniques in DNA Chemistry


by Clayton Naeve (St. Jude Children's Research Hospital) and Ronald L. Niece (University of Wisconsin Biotechnology Center)


The ABRF and the Educational Affairs Committee of the ASBMB organized this workshop on DNA chemistry at the ASBMB Meeting held in San Francisco, CA on May 21-25, 1995. Over 300 attendees heard presentations on large-scale genomic sequencing, software and database tools available for sequence analysis, quantitative PCR, and a review of the ABRF DNA sequencing study. The workshop is summarized here.

Large-scale genomic DNA sequencing

Dr. Richard K. Wilson (C. elegans Genome Consortium Genome Sequencing Center at Washington University, St. Louis) presented an update on the C. elegans genome sequencing project. The C. elegans worm is becoming extremely well characterized. Researchers have assembled a "parts list" of the organism in that they know all cells and their lineages, a wiring diagram of the nervous system has been determined, and the genetics of the organism are understood well enough to assist developmental biology research. Physical mapping of the genome is essentially complete.

The genomic sequence currently consists of 13 contigs spread over 5 autosomes and a single contig on the X chromosome. Autosomes display regions of densely mapping genes. Genetically dense regions have been identified using YAC grids with most bacterial clones coming from these regions. The Genome Consortium generates random M13 or phagemid subclones from cosmids for an initial "shotgun" sequencing phase followed by a directed or "walking" phase and a "finishing" phase for completion. The group has employed highly automated methods including plaque picking (400 plaques/hr), DNA template preparation (800 templates/hr), DNA sequencing reactions (4500 reactions/hr) as well as use of ABI automated DNA sequencers (60 lanes/gel). Using 10 automated sequencers running two gels daily on weekdays and one on weekends with 60 lanes per gel, they can obtain sequence from 7200 templates per week. Usually 5000 of these provide useful data, and the remainder consist of vector sequences. After 600-700 shotgun runs have been completed, finishing steps include joining sequences into contigs, closing gaps with long runs, sequencing second strands, resolving compressions, and final editing to resolve conflicts. The finishing rate is now 25 cosmids per month or 0.76 Mb per month in St. Louis with a similar rate of production in Cambridge. Analyses on finished sequences include BLAST searches to find protein similarities and GeneFinder runs to predict gene boundaries. Sequences are then submitted to appropriate databases.

As of May 8, the consortium had completed 16.5 Mb of the 100 Mb C. elegans genome with an accuracy of 99.99% at 6-fold redundancy. Most of chromosomes II and III have been sequenced, and the same strategy is now being applied to the X chromosome. In the sequenced portions of the genome, a gene is observed every 5.1 kb on average, 42% of predicted genes are similar to other genes in databases, and 30% of the DNA codes for genes. The total gene count is now estimated at 12,954, compared to the 5,000 genes anticipated earlier in the project. Some genomic features found so far include: i, several genes are located within introns of other genes; ii, tRNA genes have been observed within introns; iii, tRNA genes have been found in both orientations in the same intron; iv, the longest known C. elegans gene sequenced to date is 45 kb (the average length is 5 kb); v, genes with similar sequence have been observed in head-to-tail gene clusters suggesting the equivalent of operons; vi, homeobox clusters have been found; vii, several gene families have been identified; viii, families of repeats exist-some of them fall into patterns, and some are specific to a particular chromosome; and ix, two-thirds of the predicted genes have been observed.

The project is generating 1.3 Mb of finished sequence per month. At the projected rate of production,the group expects to complete all six chromosomes (100 Mb) by the end of 1998, on schedule.

Dr. Wilson completed his presentation with a brief discussion of the Merck/Washington University Expressed Sequence Tag (EST) project involving sequence determination of the 5' and 3' ends of cDNAs. He indicated that 200,000 cDNA clones are available and that one-third of these have been sequenced. Single-pass data are posted on the World Wide Web at the rate of 6000 sequences per week, including the primary 4-color fluorescence data. The EST data are also deposited into the dbEST database, which now contains over 100,000 partial cDNA sequences.

How are we going to retrieve and analyze all the data we are gathering? Dr. Mark Boguski (National Center for Biotechnology Information, NCBI, Bethesda) addressed the symposium with a presentation entitled "How to make discoveries in molecular sequence databases". Dr. Boguski outlined the tremendous rate of growth in the biomedical literature (7x106 articles in Medline, 700,000 genetics articles alone) as well as molecular databases (GenBank doubles every 20 months), and he described NCBI's intention to develop suitable software and database tools to access these data-rich resources. The NCBI maintains GenBank, which is now able to accept sequence submissions via a Web server submission page (BankIt), but NCBI still has to scan some sequences from journals. NCBI provides access to GenBank via anonymous ftp, Entrez server, CD-ROM, electronic mail, and the Internet. Search queries now total 15,000-20,000 per day.

To interpret new sequences by homology and to integrate new data against the existing "information space", NCBI has developed Entrez, an integrated information retrieval system. Entrez integrates all major nucleotide databases (GenBank, EMBL, DDBJ, dbEST, dbSTS, patents), protein databases (GenBank, SwissProt, PIR, PRF, PDI), and the Medline literature database using sophisticated homology searches and sequence neighboring. Entrez has recently been updated to include a tree-structured taxonomy database as well as a structural database for molecular modeling. The next release, due in six months, is expected to allow sequence retrieval from a genetic map, which will provide a tremendous resource for positional cloners, sequencers, and other molecular biologists.

EST sequence accrual is now occurring more rapidly than genomic data and is being used extensively: BLAST searches have doubled in the last quarter, electronic mail searches have increased 4-fold, WWW requests 5-fold, and anonymous ftp retrievals 10-fold!

A major problem facing NCBI and database users is the level of redundancy in the GenBank and EST databases. A single gene may be represented many times (genomic with introns, genomic without introns, mutants, full-length mRNA, spliced mRNA, partial mRNA, etc.). Efforts are underway to collapse these into a smaller set to allow rapid retrieval of unique sequence entries.

Quantitative PCR

Dr. Francois Ferre (Immune Response Corporation, Carlsbad, CA) spoke on the subject of quantitative PCR (Q-PCR). This technology is routinely applied in research and clinical settings to assess gene expression, to estimate virus load, and to monitor therapy. PCR offers tremendous advantages in DNA sequence detection compared to other methods. While direct methods such as Southern blots, Northern blots, slot blots, etc. require 105 to 107 target molecules and indirect methods such as the QB-replicase/branched DNA method require 104 target molecules, only PCR offers the advantage of allowing detection of 1 target molecule in a high background.

PCR can be geared for quantitation. Methods can be either direct (e.g., blotting, RNAse protection) or indirect (e.g., QB replication or branched DNA sequence amplification). For DNA, quantitation of the target requires a standard curve if it is dependent on amplification efficiency. Limiting dilution is used to make the quantitation independent from amplification. For RNA, standards generally are not available for the preparation of standard curves. Co-amplification using a reporter gene is more precise than use of a synthetic standard.

All major steps in Q-PCR must be optimized. These include nucleic acid extraction, characterization of standards, amplification conditions, and detection and quantitation of amplified products.

  1. Optimize nucleic acid extraction. DNA is usually obtained by extraction and quantitated by OD measurements, which should be verified by making several measurements. Alternatively, cell lysis followed by use of an internal PCR control such as globin or actin can be used. RNA is also prepared by extraction usually using guanidinium-HCl and quantitated by OD. Automated nucleic acid extractors are also commonly used.
  2. Characterize the controls or standards. Data were presented on the proper use of internal controls to increase overall precision. Normalization to the internal standard considerably reduced scatter in the data facilitating interpretation of HIV copy numbers during the course of therapeutic treatment.
  3. Optimize the efficiency of amplification. To obtain best results one must minimize the number of PCR cycles used; each cycle is not 100% efficient, so overall efficiency decreases with each additional cycle. However, with fewer cycles a more sensitive detection system is needed, and fortunately there have been several recent advances in detection systems.
  4. Optimize the detection system. The product of PCR amplification of the target can be quantitated following gel electrophoresis using radiolabels indirectly from blots or cut-out bands or directly using scanners. g-32P is preferred over a-32P, because a-32P gives higher backgrounds, and direct quantitation is preferred over indirect methods. Products can be quantitated directly by hybridization protection assays or probing during PCR or using enzyme-linked affinity assays with solid phase capture.

Several different Q-PCR strategies were discussed, e.g., competitive versus non-competitive RNA quantitation and commercial systems such as TaqMan (PE/ABD) and the Roche format ELISA-type biotin-labeled detection system.

The feasibility and practicality of performing relative versus absolute PCR quantitation was discussed. Absolute quantitation is dependent on efficiency, and there are three keys to accurate absolute quantitation: the internal standard must be close in composition to target being tested; the amount of internal standard must be accurate (use several methods); and PCR must be validated by other methods. Dr. Ferre encouraged proper validation of Q-PCR for research and clinical applications. Use alternative methods and understand the limits of the assay in terms of accuracy.

Several examples of the application of Q-PCR in monitoring the course of anti-HIV therapies were given illustrating the sensitivity and effectiveness of the assay.

Accuracy of automated DNA sequencing in the core facility Clayton W. Naeve (St. Jude Children's Research Hospital, Memphis) presented the ABRF Nucleic Acids Committee's DNA sequencing study results. The study-previously summarized in ABRF News (December 1994) and submitted for publication-demonstrated that, in the core facility setting, the dye-primer protocol provided the longest and most accurate reads (400-450 bases on average with post-run editing, 300-350 bases without post-run editing). However, the first 100 bases appeared less reliable. The dye-terminator protocol, while used by 75% of participating facilities, provided shorter reads (275-300 bases on average) and surprisingly these reads were not significantly improved with post-run editing. The first 100 bases appeared more reliable using the dye-terminator protocol.

The range of core facility performance also varied by protocol. For the dye-primer protocol, the "best" data set submitted for the study gave 600 bases with greater than 96% accuracy, and the "poorest" provided 300 bases at greater than 98% accuracy. For the dye-terminator protocol, the "best" data set gave 500 bases with greater than 99% accuracy and the "poorest" 300 bases with greater than 99% accuracy.

The distribution of errors was also examined. The errors produced using the dye-primer protocol were largely miscalls and no-calls (N's). Post-run editing corrected both types and extended the length of read. The errors produced using the dye-terminator protocol also consisted largely of miscalls and no-calls. However, while editing corrected most no-calls and fewer miscalls, deletions between bases 300-400 were not corrected and the length of read was not extended.

There were no sequence motifs contributing to poor base calling in the test template; however, the typical factors contributing to errors were observed: low signal strength, loss of resolution, and weak C's following G's.


Return to the The ABRF Home Page


Created: 27th July 1995
Last modified: 27th July 1995