The genomic sequence currently consists of 13 contigs spread over 5 autosomes and a single contig on the X chromosome. Autosomes display regions of densely mapping genes. Genetically dense regions have been identified using YAC grids with most bacterial clones coming from these regions. The Genome Consortium generates random M13 or phagemid subclones from cosmids for an initial "shotgun" sequencing phase followed by a directed or "walking" phase and a "finishing" phase for completion. The group has employed highly automated methods including plaque picking (400 plaques/hr), DNA template preparation (800 templates/hr), DNA sequencing reactions (4500 reactions/hr) as well as use of ABI automated DNA sequencers (60 lanes/gel). Using 10 automated sequencers running two gels daily on weekdays and one on weekends with 60 lanes per gel, they can obtain sequence from 7200 templates per week. Usually 5000 of these provide useful data, and the remainder consist of vector sequences. After 600-700 shotgun runs have been completed, finishing steps include joining sequences into contigs, closing gaps with long runs, sequencing second strands, resolving compressions, and final editing to resolve conflicts. The finishing rate is now 25 cosmids per month or 0.76 Mb per month in St. Louis with a similar rate of production in Cambridge. Analyses on finished sequences include BLAST searches to find protein similarities and GeneFinder runs to predict gene boundaries. Sequences are then submitted to appropriate databases.
As of May 8, the consortium had completed 16.5 Mb of the 100 Mb C. elegans genome with an accuracy of 99.99% at 6-fold redundancy. Most of chromosomes II and III have been sequenced, and the same strategy is now being applied to the X chromosome. In the sequenced portions of the genome, a gene is observed every 5.1 kb on average, 42% of predicted genes are similar to other genes in databases, and 30% of the DNA codes for genes. The total gene count is now estimated at 12,954, compared to the 5,000 genes anticipated earlier in the project. Some genomic features found so far include: i, several genes are located within introns of other genes; ii, tRNA genes have been observed within introns; iii, tRNA genes have been found in both orientations in the same intron; iv, the longest known C. elegans gene sequenced to date is 45 kb (the average length is 5 kb); v, genes with similar sequence have been observed in head-to-tail gene clusters suggesting the equivalent of operons; vi, homeobox clusters have been found; vii, several gene families have been identified; viii, families of repeats exist-some of them fall into patterns, and some are specific to a particular chromosome; and ix, two-thirds of the predicted genes have been observed.
The project is generating 1.3 Mb of finished sequence per month. At the projected rate of production,the group expects to complete all six chromosomes (100 Mb) by the end of 1998, on schedule.
Dr. Wilson completed his presentation with a brief discussion of the Merck/Washington University Expressed Sequence Tag (EST) project involving sequence determination of the 5' and 3' ends of cDNAs. He indicated that 200,000 cDNA clones are available and that one-third of these have been sequenced. Single-pass data are posted on the World Wide Web at the rate of 6000 sequences per week, including the primary 4-color fluorescence data. The EST data are also deposited into the dbEST database, which now contains over 100,000 partial cDNA sequences.
How are we going to retrieve and analyze all the data we are gathering? Dr. Mark Boguski (National Center for Biotechnology Information, NCBI, Bethesda) addressed the symposium with a presentation entitled "How to make discoveries in molecular sequence databases". Dr. Boguski outlined the tremendous rate of growth in the biomedical literature (7x106 articles in Medline, 700,000 genetics articles alone) as well as molecular databases (GenBank doubles every 20 months), and he described NCBI's intention to develop suitable software and database tools to access these data-rich resources. The NCBI maintains GenBank, which is now able to accept sequence submissions via a Web server submission page (BankIt), but NCBI still has to scan some sequences from journals. NCBI provides access to GenBank via anonymous ftp, Entrez server, CD-ROM, electronic mail, and the Internet. Search queries now total 15,000-20,000 per day.
To interpret new sequences by homology and to integrate new data against the existing "information space", NCBI has developed Entrez, an integrated information retrieval system. Entrez integrates all major nucleotide databases (GenBank, EMBL, DDBJ, dbEST, dbSTS, patents), protein databases (GenBank, SwissProt, PIR, PRF, PDI), and the Medline literature database using sophisticated homology searches and sequence neighboring. Entrez has recently been updated to include a tree-structured taxonomy database as well as a structural database for molecular modeling. The next release, due in six months, is expected to allow sequence retrieval from a genetic map, which will provide a tremendous resource for positional cloners, sequencers, and other molecular biologists.
EST sequence accrual is now occurring more rapidly than genomic data and is being used extensively: BLAST searches have doubled in the last quarter, electronic mail searches have increased 4-fold, WWW requests 5-fold, and anonymous ftp retrievals 10-fold!
A major problem facing NCBI and database users is the level of redundancy in the GenBank and EST databases. A single gene may be represented many times (genomic with introns, genomic without introns, mutants, full-length mRNA, spliced mRNA, partial mRNA, etc.). Efforts are underway to collapse these into a smaller set to allow rapid retrieval of unique sequence entries.
PCR can be geared for quantitation. Methods can be either direct (e.g., blotting, RNAse protection) or indirect (e.g., QB replication or branched DNA sequence amplification). For DNA, quantitation of the target requires a standard curve if it is dependent on amplification efficiency. Limiting dilution is used to make the quantitation independent from amplification. For RNA, standards generally are not available for the preparation of standard curves. Co-amplification using a reporter gene is more precise than use of a synthetic standard.
All major steps in Q-PCR must be optimized. These include nucleic acid extraction, characterization of standards, amplification conditions, and detection and quantitation of amplified products.
The feasibility and practicality of performing relative versus absolute PCR quantitation was discussed. Absolute quantitation is dependent on efficiency, and there are three keys to accurate absolute quantitation: the internal standard must be close in composition to target being tested; the amount of internal standard must be accurate (use several methods); and PCR must be validated by other methods. Dr. Ferre encouraged proper validation of Q-PCR for research and clinical applications. Use alternative methods and understand the limits of the assay in terms of accuracy.
Several examples of the application of Q-PCR in monitoring the course of anti-HIV therapies were given illustrating the sensitivity and effectiveness of the assay.
Accuracy of automated DNA sequencing in the core facility Clayton W. Naeve (St. Jude Children's Research Hospital, Memphis) presented the ABRF Nucleic Acids Committee's DNA sequencing study results. The study-previously summarized in ABRF News (December 1994) and submitted for publication-demonstrated that, in the core facility setting, the dye-primer protocol provided the longest and most accurate reads (400-450 bases on average with post-run editing, 300-350 bases without post-run editing). However, the first 100 bases appeared less reliable. The dye-terminator protocol, while used by 75% of participating facilities, provided shorter reads (275-300 bases on average) and surprisingly these reads were not significantly improved with post-run editing. The first 100 bases appeared more reliable using the dye-terminator protocol.
The range of core facility performance also varied by protocol. For the dye-primer protocol, the "best" data set submitted for the study gave 600 bases with greater than 96% accuracy, and the "poorest" provided 300 bases at greater than 98% accuracy. For the dye-terminator protocol, the "best" data set gave 500 bases with greater than 99% accuracy and the "poorest" 300 bases with greater than 99% accuracy.
The distribution of errors was also examined. The errors produced using the dye-primer protocol were largely miscalls and no-calls (N's). Post-run editing corrected both types and extended the length of read. The errors produced using the dye-terminator protocol also consisted largely of miscalls and no-calls. However, while editing corrected most no-calls and fewer miscalls, deletions between bases 300-400 were not corrected and the length of read was not extended.
There were no sequence motifs contributing to poor base calling in the test template; however, the typical factors contributing to errors were observed: low signal strength, loss of resolution, and weak C's following G's.
Return to the The ABRF Home Page