Summaries of Presentations at ABRF '97: Techniques at the Genome Proteome Interface


Two of the presentations from the ABRF meeting held in Baltimore, MD on February 9-12, 1997 are summarized here. Additional presentations will be summarized in subsequent issues of ABRF News.


Gene Mining: Finding the Gold

Buried Within the Human Genome


Richard K. Wilson

Washington University School of Medicine

 

A session at the 1997 ABRF Meeting entitled "High Throughput DNA Sequencing" focused on the impact that large scale sequencing of expressed sequence tags (ESTs) is having on the field of human genomic research and the hunt for human disease genes. ESTs are single pass, unedited nucleotide sequence reads generated from the ends of random cDNA clones. This approach has often been touted as a more effective means than complete genome sequencing for "mining" all the important gene sequences from the human genome. The three speakers,Eric Green, Mark Boguski and Richard Wilson described various aspects of EST projects and how the data generated might eventually lead to discovery of all human genes and a more comprehensive understanding of human genetics and biology.

 

Since its inception in the late 1980s, the Human Genome Project has had as one of its central goals the identification of all of the 100,000 or so genes buried within the human genome. How best to attain this goal has often been a point of argument. On the one hand, sequencing the genome in its entirety would of course provide the sequence of all genes. Since most human genes are a discontinuous mix of exons, sometimes quite widely spaced, computer analysis of the complete sequence would not be able to immediately identify all of the genes. But as computer tools improved and genome sequence data from other organisms such as yeast, roundworm and mouse became available, more and more genes could be identified within the sequence and accurately annotated. However, it has often been pointed out that only three to five percent of the genome sequence actually encodes the full set of genes. The remainder contains repetetive elements and other non-coding "spacer" DNA which are of considerably less interest to those mainly concerned with identifying all of the genes. Why sequence all of the human genome when only five percent is interesting?

 

With this thought in mind, many have argued that a better alternative to complete genome sequencing is to sequence only the DNA that derives from the genes themselves complementary DNA (cDNA). cDNA is derived by using a viral enzyme, reverse transcriptase, to convert RNA transcripts of the genes into DNA copies which can be cloned and sequenced using standard molecular methods. It is argueable as to who first conceived of this approach as a means for surveying all human genes, but certainly the first group to put it to work on a large scale was the team of Mark Adams, Tony Kerlavage, Dick McCombie and Craig Venter, working at the NIH in the late 1980's (1). Their approach was to sequence short segments at both ends of random cDNA clones from a variety of tissue types. Since only short sequence reads were attempted from each clone, it was possible to rapidly sample a large number of cDNAs from any particular tissue. Key to their success was the advent of the fluorescence- based DNA sequencing technology that was developed in Leroy Hood's laboratory in the mid-1980's (2). Although the data obtained by the NIH group was incomplete and of low accuracy, it was immediately apparent that this expressed sequence tag (EST) strategy was a very effective means of rapidly identifying and "tagging" human genes.

 

The initial success of the NIH group spawned additional large scale EST projects, both in industry and academia. Biotechnology companies, such as Human Genome Sciences and Incyte, were founded on the power of large scale EST sequencing and its potential value for drug discovery. Many of these large industrial efforts took a proprietary approach to EST data, eventually leading to the Merck-sponsored EST project at Washington University described below (3). The Merck and other academic EST projects have produced an excellent public database of short cDNA tags which currently represent approximately 55,000 unique human genes (4). This database is a powerful resource for the identification and study of human disease genes. It further provides effective tools for mapping, complete sequencing and comprehensive analysis of the human genome.

 

Why sequence all of the human genome? Because for all of its promise, large scale EST sequencing will ultimately provide only part of the picture. Although EST efforts have quickly added clues about new genes to the public database, they are vulnerable to the law of diminishing returns. Since cDNAs are not present in any tissue in equal abundance, it is likely that some rarely expressed genes will be missed. Procedures aimed at normalizing the relative abundance of cDNAs have been developed and have facilitated the identification of some of the more rarely expressed genes, but it is very apparent that many genes are still unrepresented in the database. Furthermore, it is important to remember that the remainder of the human genome the so-called "junk DNA" is not without worth. As Sydney Brenner once said, "Junk you keep, garbage you throw away." It is these non-coding regions of the genome that contain the elements which control and regulate gene expression and modulate chromosomal structure and evolution. Besides providing the sequences for genes missed by the EST approach, the complete genome sequence will begin to provide the clues as to how genes actually work as part of the larger machinery to create, sustain and modify life.

 

At their least, large scale EST projects will help to identify a significant number of novel genes, many of which will be important to the study of human disease. More importantly, EST projects will provide a resource for the comprehensive analysis of the human genome. It is from this vantage point that the session "High Throughput DNA Sequencing" was convened at the 1997 ABRF Meeting. Three speakers who have been involved in various aspects of the Human Genome Project since its beginnings described the generation and usefulness of EST data in their ongoing work. The session organizer, Richard K. Wilson, has worked on the Washington University-Merck EST project as well as on large scale projects to sequence the complete Caenorhabditis elegans genome and human chromosome 7. MarkBoguski, from the National Center for Biotechnology Information in Bethesda, MD, has been a key player in building and analyzing the public EST database. Eric Green, from the National Human Genome Research Institute in Bethesda, has focused on using ESTs as reagents for improving physical maps of the human genome - specifically chromosome 7 - in an attempt to speed the search for disease genes which are not represented in the current EST database.

 

Wilson described the efforts of the Genome Sequencing Center at Washington University School of Medicine to "tag sequence" large numbers of human and mouse cDNAs and to place the data into public databases as rapidly as possible. The EST projects at Washington University are sponsored by Merck (human) and the Howard Hughes Medical Institute (mouse). While the best known EST projects have focused on human genes, projects utilizing cDNAs from mouse, roundworm and other model organisms have also been useful for finding novel genes and gaining clues as to possible function. Since the Merck-sponsored project began in 1995, the GSC has submitted nearly 500,000 human ESTs to the database. According to analyses done by Boguski and colleagues, this represents nearly 55,000 individual human genes. A key supplier of cDNA libraries for both the human and mouse EST projects has been Bento Soares of Columbia University. Soares developed a means for partial normalization of cDNA libraries which greatly minimizes abundant transcripts, thereby facilitating the identification of rarely expressed gene transcripts (5). To further improve the chances for identifying novel genes, cDNA libraries were prepared from a variety of tissues.

 

The normalization procedure and the use of multiple tissues improve the rate at which novel genes are sampled, but it is clear that subtraction and other more directed approaches will be needed to add significantly to the collection of unique genes. Thus, Wilson went on to discuss genomic DNA sequencing, which is his group's main focus. Wilson described the effort to sequence the complete 100 million base pair genome of the roundworm C. elegans genome, which is being done as a collaborative effort between the St. Louis group and the Sanger Centre in England. To date, more than 85 percent of the C. elegans genome is either finished or available via the Internet. Although C. elegans genes are small compared to human genes and thus more readily identified using existing computer tools, questions often remain about where a gene begins or ends and which putative exons are actually included in the final transcript. EST projects utilizing C. elegans cDNA clones and conducted by the St. Louis group and Yuji Kohara of The National Institute of Genetics, Japan have provided important clues to gene structure, organization and expression. Wilson showed several examples of genes within the introns of other genes and alternative splicing patterns. In human genomic sequencing efforts now underway, Wilson indicated that ESTs have similarly been an important and useful tool for analysis and annotation.

The second speaker, Mark Boguski, described work being done by his group at the NCBI in analyzing EST data and using it to form the beginnings of a transcript map of the human genome. Since cDNAs can be directionally cloned, a sequence read from the 5' end provides data from within the coding region and thus information about the protein encoded by the expressed gene. A 3' sequence read provides data from the untranslated tail of the transcript and thus serves as a unique identifier for the gene because it allows differentiation between similar members of gene families. The 5' and 3' EST reads can be "clustered" into distinct assemblies, by searching for perfect matches between all sequences in the database. Since the 3' untranslated region is more prone to variation, determination of the minimal set of 3' ESTs provides a good estimation of the actual number of genes represented by an EST collection. For the ESTs currently in the public database, such clustering analysis suggests that approximately 55,000 distinct human genes have been identified. With this minimal set of human genes in hand, others have begun to map the genes across the human genome using the technique of radiation hybrid mapping. The resulting transcript map will speed efforts to pinpoint those genes responsible for human diseases by providing key landmarks on the physical map. Once researchers know the approximate location of a disease locus on a particular chromosome, they can "zoom in" on the genes which have been mapped to the region and begin analysis to determine whether any of these are indeed the culprit. In addition to this work, Boguski went on to discuss how microarrays or "chips" could utilize the minimal set of ESTs to analyze and compare gene expression patterns in various cell types and developmental stages.

 

The third speaker, Eric Green, discussed the use of ESTs as powerful reagents for improving the physical map of the human genome. His laboratory at the NHGRI has focused on mapping human chromosome 7, which is approximately 170 Mb in size and contains many genes implicated in human disease. ESTs have provided valuable landmarks to supplement the numerous other sequence-tagged sites (STSs) for aligning YAC clones and constructing YAC contigs. As more ESTs are added to the map, the resolution increases and the map becomes an increasingly powerful tool for finding disease genes. Green's laboratory has taken the additional step of using the technique of direct cDNA selection to derive ESTs specifically from genes located on chromosome 7 (for details see reference 6). In this approach, they use cloned genomic DNA from chromosome 7 as an affinity matrix to "capture" portions of cDNAs from chromosome 7 genes. Green described the results of this approach, which included generating 2000 ESTs enriched for chromosome 7 genes and, to date, mapping 200 to specific locations on chromosome 7, thereby providing a new set of gene-specific landmarks for the physical map.

 

This session made clear that several different approaches will be necessary to effectively mine all of the riches from the human genome. The efforts of all three of the groups represented in this session, along with similar work being done elsewhere, have begun to coalesce along several lines to increase the rate at which these riches have begun to emerge. Large scale EST sequencing, while it has been a springboard for some companies and individuals into gene discovery and quick surveys of gene expression, is proving most valuable as a tool for a more careful and comprehensive whole genome approach.

 

Please see the following World Wide Web sites for EST data and additional information:

 

http://genome.wustl.edu/gsc/gschmpg.html

http://www.ncbi.nlm.nih.gov/

http://www.nhgri.nih.gov/DIR/GTB/CHR

 

References

 

1. M.D. Adams, J.M. Kelley, J.D. Gocayne, M. Dubnick, M.H. Polymeropoulos, H. Xiao, C.R. Merril, A. Wu, B. Olde, R.F. Moreno et al. (1991) Science 252, 1651-1656.

2. L.M. Smith, J.Z. Sanders, R.J. Kaiser, P. Hughes, C. Dodd, C.R. Connell, C. Heiner, S.B. Kent and L.E. Hood. (1986) Nature 321, 674-679.

3. L. Hillier, G. Lennon, M. Becker, M.F. Bonaldo, B. Chiapelli, S. Chissoe, N. Dietrich, T. DuBuque, A. Favello, W. Gish et al. (1996) Genome Res. 6, 807-828.

4. G.D. Schuler, M.S. Boguski, E.A. Stewart, L.D. Stein, G. Gyapay, K. Rice, R.E. White, P. Rodriguez-Tome, A. Aggarwal, E. Bajorek et al. (1996) Science 274, 540-546.

5. M.B. Soares, M.F. Bonaldo, P. Jelene, L. Su, L. Lawton and A. Efstratiadis. (1994) Proc. Natl. Acad. Sci. USA 91, 9228-9232.

6. J.W. Touchman, G.G. Bouffard, L.A. Weintraub, J.R. Idol, L. Wang, C.M. Robbins, J.C. Nussbaum, M. Lovett, and E.D. Green, E.D. (1997) Genome Research, 7, 281-292.


Richard K. Wilson may be contacted at the Department of Genetics, Washington University School of Medicine, 4566 Scott Avenue, Box 8232, St. Louis, MO, 63110, Tel: (314) 362-7666, Email: rick@geneman.wustl.edu


Return to the The ABRF Home Page


Created: 13th June 1997
Last modified: 13th June 1997